# learning_parities_with_neural_networks__07719b2c.pdf Learning Parities with Neural Networks Amit Daniely School of Computer Science, The Hebrew University, Israel. Google Research Tel Aviv. amit.daniely@mail.huji.ac.il Eran Malach School of Computer Science, The Hebrew University, Israel. eran.malach@mail.huji.ac.il In recent years we see a rapidly growing line of research which shows learnability of various models via common neural network algorithms. Yet, besides a very few outliers, these results show learnability of models that can be learned using linear methods. Namely, such results show that learning neural-networks with gradientdescent is competitive with learning a linear classifier on top of a data-independent representation of the examples. This leaves much to be desired, as neural networks are far more successful than linear methods. Furthermore, on the more conceptual level, linear models don t seem to capture the deepness" of deep networks. In this paper we make a step towards showing leanability of models that are inherently non-linear. We show that under certain distributions, sparse parities are learnable via gradient decent on depth-two network. On the other hand, under the same distributions, these parities cannot be learned efficiently by linear methods. 1 Introduction The remarkable success of neural-networks has sparked great theoretical interest in understanding their behavior. Impressively, a large number of papers [4, 26, 10, 8, 6, 15, 11, 20, 2, 3, 7, 28, 25, 13, 21, 5, 7, 16, 19, 18, 9] have established polynomial-time learnability of various models by neural networks algorithms (i.e. gradient based methods). Yet, to the best of our knowledge, with the single exception of learning one neuron [27], all these results prove learnability of linear models. Namely, models that can be realized by a linear classifier, on top of a (possibly random) embedding that is fixed and does not depend on the data. This is not surprising, as the majority of these papers prove learnability via linearization" of the network at the vicinity of the initial random weights. While these results achieved remarkable progress in understanding neural-networks, they are still disappointing in some sense. Indeed, in practice, neural-networks performance is far better than linear methods, a fact that is not explained by these works. Moreover, learning a linear classifier on top of a fixed embedding seems to completely miss the deepness" of deep learning. How far can neural network theory go beyond linear models? In this work we show a family of distributions on which neural-networks trained with gradient-descent achieve small error. On the other hand, approximating the same family using a linear classifier on top of an embedding of the input space in RN, requires N which grows exponentially, or otherwise requires a linear classifier with exponential norm. Specifically, we focus on a standard and notoriously difficult family of target functions: parities over small subsets of the input bits. We show that this family is learnable with neural-networks under some specific choice of distributions. This implies that neural-networks algorithms are strictly stronger than linear methods, as the same family cannot be approximated by any polynomial-size linear model. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. 1.1 Related Work Recently, a few works have provided theoretical results demonstrating that neural-networks are stronger than random features - a linear model where the embedding map is randomly drawn from some predefined distribution [22]. These works show problems that are easy to learn with neuralnetworks, while being hard to learn with random features. The work of [27] shows that random features cannot approximate a distribution generated by a single neuron and Gaussian inputs, which is known to be learnable by a neural-network. The work of [1] shows that neural-networks are more efficient than random features, in terms of sample complexity and run-time, for some regression problems generated by a Res Net-like network. A work by [14] shows other family of distributions where neural-networks with quadratic activation outperform random features, when the number of features is smaller than the dimension. Our result differs from these works in several aspects. First, [27] and [1] study the power of approximating a regression problems, and hence their results cannot be applied to the setting of classification, which we cover in this work. Second, we give an exponential separation, while [1] and [14] only give polynomial separation. Namely, the problems for which they show that neural networks perform better than linear methods are still poly-time learnable by linear methods. 2 Problem Setting be the instance space, and Y = { 1} the label space. Since we focus on a binary classification task, we take the hinge-loss (y, ˆy) = max{1 yˆy, 0} to be our primary loss function. For some distribution D over X Y, and some function h : X ! Y, we define the loss of h over the distribution to be: LD(h) = E (x,y) D [ (y, h(x))] Let H be some class of functions from X to Y. We define the loss of H with respect to the distribution D to be the loss of the best function in H: LD(H) = min So, LD(H) measures whether H can approximate the distribution D. The Class F. Our target functions will be parities on k bits of the input. Let A [n] be some subset of size |A| = k, for some odd k 3, and define f A to be the parity of the bits in A, namely f A(x) = sign(Q i2A xi). For every subset A [n], we construct a distribution on the instances X that is easy to learn with neural-networks. Let D(1) A be the uniform distribution on X, and let D(2) A be the distribution that is uniform on all the bits in [n] \ A, and the bits in A are all 1 w.p. 1 2 and 1 w.p. 1 2. Let DA be a distribution over X Y where we samples x D(1) 2 and x D(2) 2, and set y = f A(x). This defines a family of distributions F = {DA : A [n], |A| = k}. Comment 1. Note that the above distributional assumptions are not typical. For example, it is more natural to assume that the underlying distribution is the uniform distribution. However, it is well-known that learning parities under the uniform distribution is computationally hard for statistical-query algorithms [17] and gradient-based algorithms [24]. Therefore, we must take an easier distribution to make the problem learnable using gradient-descent. The training algorithm. We train a neural-network with gradient-descent on the distribution DA. Let g(t) : X ! R be our neural-network at time t: i 2 R, w(t) i 2 Rn and σ denotes the Re LU6 activation σ(x) = min(max(x, 0), 6). Define a regularization term R(g(t)) = ))u(t)))2 + P2q , and the hinge-loss function (y, ˆy) = max(1 ˆyy, 0). Then, the loss on the distribution is LD(g) = E [ (y, g(x))], and we perform the following updates: w(t) = w(t 1) t LD(g(t 1)) + λt R(g(t 1)) u(t) = u(t 1) t LD(g(t 1)) + λt R(g(t 1)) for some choice of 1, . . . , T and λ1, . . . , λT . We assume the network is initialized with a symmetric initialization: for every i 2 [q] initialize w(0) i U({ 1, 0, 1}n) and then initialize w(0) q+i = wi, initialize b(0) i = 1 8k and b(0) initialize u(0) k ]) and u(0) 3 Main Result Our main result shows a separation between neural-networks and any linear method (i.e., learning a linear classifier over some fixed embedding). This result is composed of two parts: first, we show that the family F cannot be approximated by any polynomial-size linear method. Second, we show that neural-networks can be trained to approximate the family F using gradient-descent. The following theorem implies that the class F cannot be approximated by a linear classifiers on top of a fixed embedding, unless the embedding dimension or the norm of the weights is exponential: Theorem 2. Fix some : X ! [ 1, 1]N, and define: = {x ! h (x), wi : kwk2 B} Then, if k n 16, there exists some DA 2 F such that: The following result shows that neural-networks can learn the family F with gradient-descent. That is, for every distribution DA 2 F, a large enough neural-network achieves a small error when trained with gradient-descent on the distribution DA. Together with theorem 2, it establishes an (exponential) separation between the class of distributions that can be learned with neural-networks, and the class of distributions that can be learned by linear methods. Theorem 3. Fix some DA 2 F. Assume we run gradient-descent for T iterations, with 1 = 1, λ1 = 1 2 and t = k2 T pq, λt k n for every t > 1. Assume that n (1) and 7 k O( 10pn). Fix some δ > 0, and assume that the number of neurons satisfies q (k7 log k δ ). Then, with probability at least 1 δ over the initialization, there exists t T such that: g(t)(x) 6= f A(x) pq + qk pn + k2pq Observe that the above theorem immediately implies that neural networks of polynomial size, trained using gradient-descent for a polynomial number of steps, achieve small loss for every DA 2 F: Corollary 4. Fix , δ 2 (0, 1/2). For every DA 2 F, for large enough n, running gradient-descent with the assumptions in Theorem 3, achieves with probability at least 1 δ a loss of at most when training a network of size q = poly(k, 1, log 1/δ) for T = poly(k, 1, log 1/δ) iterations. Proof. Choose q = ( 1k16 log(k/δ)), T = ( 2k10 log(k/δ)) and n ( 4k34 log(k/δ)). 4 Proof of Theorem 2 Proof. Let LA(w) := LD(1) A (h (x), wi) and define the objective GA(w) := LA(w) + λ 2 kwk2. Observe that for every i 2 [N] we have: GA(0) = E x D [f A(x) i(x)] Since {f A}A [n] is a Fourier basis, we have: E x D [f A(x) i(x)]2 = k ik2 And therefore: E A [n],|A|=k kr GA(0)k2i = E A [n],|A|=k = E A [n],|A|=k E x D [f A(x) i(x)]2 E x D [f A(x) i(x)]2 N Where we use the fact that (n/k)k 16k. Using Jensen inequality we get: E A [kr GA(0)k]2 E kr GA(0)k2i Note that GA is λ-strongly convex, and therefore, for every w, u we have: hr GA(w) r GA(u), w ui λ kw uk2 A := arg minw GA(w), and so r GA(w I) = 0. Using the above we get: Ak2 hr GA(w A) r GA(0), w A i kr GA(0)k kw λ kr GA(0)k Now, notice that LA is N-Lipschitz, since: kr LA(w)k = kr E [ (y, h (x), wi)]k E [| 0| k (x)k] Therefore, we get that: A) = LA(0) LA(w N λ kr GA(0)k (2) Denote ˆw A = arg minkwk B LA(w), and by optimality of w Ak2 LA( ˆw A) + λ 2 k ˆw Ak2 LA( ˆw A) + λB2 From (2) and (3) we get: N λ kr GA(0)k LA(w A) LA( ˆw A) + λB2 Taking an expectation and plugging in (1) we get: E A [n],|A|=k A [LA( ˆw A)] 1 N λ E A [kr GA(0)k] λB2 Since this is true for all λ, taking λ = 2N 2k B we get: E A [n],|A|=k Therefore, there exists some A [n] with |A| = k such that minh2HB Since DA = 1 A we get the required. 5 Proof of Theorem 3 We start by giving a rough sketch of the proof of Theorem 3. We divide the proof into two steps: First gradient step. We show that after the first gradient step, there is a subset of good neurons in the first layer that approximately implement the function j(x) := σ( j i2A xi + bj), for some j and bj. Indeed, observe that the correlation between every bit outside the parity and the label is zero, and so the gradient with respect to this bit becomes very small. However, for the bits in the parity, the correlation is large, and so the gradient is large as well. Convergence of online gradient-descent. Notice that the parity can be implemented by a linear combination of the features 1(x), . . . , q0(x), when 1, . . . , q0 are distributed uniformly. Hence, from the previous argument, after the first gradient step there exists some choice of weights for the second layer that implements the parity (and hence, separates the distribution). Now, we show that for a sufficiently large network and sufficiently small learning rate, the weights of the first layer stay close to their value after the first iteration. Thus, a standard analysis of online gradient-descent shows that gradient-descent (on both layers) reaches a good solution. In the rest of this section, we give a detailed proof, following the above sketch. For lack of space, some of the proofs for the technical lemmas appear in the appendix. 5.1 First Gradient Step We want to show that for some good neurons, the weights w(1) i are close enough to i j2A ej for some constant i depending on u(0) i . We start by showing that the irrelevant coordinates (j /2 A) and the bias have very small gradient. To do this, we first analyze the gradient with respect to the uniform part of the distribution D(1) A , and show that it is negligible, with high probability over the initialization of a neuron: Lemma 5. Fix j 2 [n] \ A and b 2 R and let D be the uniform distribution. Let f(x) = sign(Q i2A xi) be a parity. Then, for every c > 0, we have with probability at least 1 1 c over the choice of w: 77Ex xjf(x) σ0(w>x + b > 0) k ). A similar result holds for 77Ex f(x) σ0(w>x + b > 0) Using a union bound on the previous lemma, we get that the above result holds for all irrelevant coordinates (and the bias), with constant probability: Lemma 6. Let k, n be odd numbers. Fix b 2 R and let D be the uniform distribution. Let f(x) = sign(Q i2A xi) be a parity. Then, for every C > 0, with probability at least 1 1 C over the choice of w: 8j /2 A, 77Ex xjf(x) σ0(w>x + b > 0) k ) and 77Ex f(x) σ0(w>x + b > 0) Proof. of Lemma 6. Choose c = C(n 1) and use union bound on the result of Lemma 5 over all choices of j /2 A and the bias. Now, we show that for neurons with P j2A wj = 0, the gradient of the irrelevant coordinate and the bias is zero on the distribution D(2) A (the non-uniform part of the distribution DA): Lemma 7. Let k, n be an odd numbers, let f(x) = sign(Qk i2A xi). Fix w 2 {1, 0, 1}n with Pk i=1 wi = 0 and b 2 R. Then, on the distribution D(2) A , we have: xjf(x) σ0(w>x + b > 0) = 0 for all j /2 A f(x) σ0(w>x + b > 0) Combining the above lemmas implies that for some good neurons, the gradient on the distribution DA is negligible, for the irrelevant coordinates and the bias. Now, it is left to show that for the coordinates of the parity (j 2 A), the gradient on the distribution DA is large. To do this, we show that the gradient on the relevant coordinates is almost independent from the gradient of the activation function. Since the gradient with respect to the hinge-loss at the initialization is simply the correlation, this is sufficient to show that the gradient of the relevant coordinates is large. Lemma 8. Observe the distribution DA. Let h : X ! { 1} some function supported on A. Let w 2 { 1, 0, 1}n be some vector, and b 2 R. Denote J = {j 2 [n] \ A : wj 6= 0} and denote '(w, b) = P j2J wjxj + b < 6 k pn . Then there exists a universal constants C s.t.: h(x) σ0(w>x + b) E [h(x)] '(w, b) From all the above, we get that with non-negligible probability, the weights of a given neuron are approximately i j2A ej, for some choice of i depending on u(0) Lemma 9. Assume n 9 log 7 and 7 k 1 2 4pn. Fix some i 2 [q]. Then, with probability at k over the choice of w(0) i , we have that: maxj2A i,j i pnu(0) 777 C1, maxj /2A 77b(1) b(0)77 C3 pn for some universal constants C1, C2, C3, and some i 2 depending on w(0) Finally, we show that the features implemented by the good neurons can express the parity function, using a linear separator with low norm. In the next two lemmas we show explicitly what are the features that the good neurons approximate: Lemma 10. Let b = 1 8k, 2 [ 1 4, 1] and r 2 { k, k + 2, . . . , k 2, k}. Denote φr(z) = σ( sign(r)z + |r|)). Let u U([ n k ]). Then, for every kr, with probability at least 8k2 over the choice of u we have nuz + b) φr(z) 777 , for every z 2 [ k, k]. Lemma 11. Fix > 0 and assume that n 9 log 7 and 7 k c 4p 8pn, for some universal constant c. Fix r 2 { k, k +2, . . . , k 2, k}, kr and define r(x) = σ( sign(r)pn P |r|). Fix some i 2 [q]. Then, with probability at least 112k2.5 over the choice of w(0) i then for b i(x) = |r| b(0) 777 b r(x) r(x) 777 2 for all x 2 X. Using the above, we show that there exists a choice for the weights for the second layer that implement the parity, with high probability over the initialization: Lemma 12. Assume that n 9 log 7 and 7 k c 10pn, for some universal constant c. Fix some δ > 0, and assume that the number of neurons satisfies q Ck7 log( k+1 δ ). Then, with probability at least 1 δ over the choice of the weights, there exists u 2 R2q such that g (x) = P2q satisfies g (x)f A(x) 1 for all x 2 X. Further- more, we have ku k2 B k5 pq, ku k0 = B q k2.5 , and for every i 2 [2q] with u i 6= 0 we have 1, for some universal constants B, B. This concludes the analysis of the first gradient step. 5.2 Convergence of Gradient-Descent Our main result in this part relies on the standard analysis of online gradient-descent. Specifically, this analysis shows that performing gradient-descent on a sequence of convex functions reaches a set of parameters that competes with the optimum (in hindsight). We give this result in general, when we optimize the functions f1, . . . , ft with respect to the parameter : Theorem 13. (Online Gradient Descent) Fix some , and let f1, . . . , f T be some sequence of convex functions. Fix some 1, and assume we update t+1 = t rft( t). Then for every the following holds: ft( ) + 1 2 T k k2 + k 1k 1 krft( t)k + 1 Note that in the previous part we showed that the value of the weights of the first layer is good with high probability. In other words, optimizing only the second layer after the first gradient step is sufficient to achieve a good solution. However, since we optimize both layers with gradient-descent, we need to show that the weights of the first layer stay close to their value after the first initialization. We start by bounding the weights of the second layer after the first iteration: Lemma 14. Assume 1 = 1 and λ1 = 1 2. Then for every i 2 [q] we have Using this, we can bound how much the first layer changes after at every gradient-step: Lemma 15. Assume that 1 = 1, λ1 = 1 2 and t = , λt = λ for every t > 1, for some fixed value , λ 2 [0, 1 2]. For every t and every i 2 [2q] we have 777 +6 t + k pn, 6 2t2 + t k pn2 tλ n 777 6 2t2 + t k pn. Using the above we bound the difference in the loss between optimizing the first layer and keeping it fixed, for every choice of u for the second layer: Lemma 16. Fix some vector u 2 R2q, and let g(t) u (x) = P2q u (x), y) (g(1) 777 2 ku k2 6 2t2 + t k pn + tλ n Finally, using all the above we can prove our main theorem: Proof. of Theorem 3. Let u 2 R2q be the separator from Lemma 12, and we have ku k2 B k5 pq and ku k0 = B q k2.5 . Denote LD(g(t)) = E [ (g(x), y)] + λt ))u(t)))2, and notice that the gradient of LD with respect to u is the same as the gradient of the original objective. From Lemma 14, we have ))u(1))) p2qk pn . Since LD is convex with respect to u, from Theorem 13 we have: u ) + ku k2 u ) + B2k10 2qk pn + 36 q Using Lemma 16 we get that for every t we have: 777 LD(g(t) u ) LD(g(1) 777 2 ku k2 6 2t2 + t k pn + tλn Therefore we get: LD(g(t)) LD(g(1) u ) + B0k4 + 2qk pn + 36 q Now, take = k2 T pq. Since u separates the distribution D with margin 1, when taking the weights after the first iteration, we have LD(g(1) 2 ku k2 = B2 k10 2q . Therefore: LD(g(t)) B2 k10 2q + B0 6k8 2qk pn + 36k2pq From this, there exists some 2 t T + 1 such that: LD(g(t)) B2 k10 2q + B0 6k8 2qk pn + 36k2pq T And since the hinge-loss upper bounds the zero-one loss, we get the required. 0.5 0.6 0.7 0.8 0.9 Re LU network NTK regime Gaussian features Re LU features 0.5 0.6 0.7 0.8 0.9 Figure 1: MNIST-parity experiment. Top left: test performance on parity of a single MNIST image. Top right: test performance on parity of three MNIST images. Bottom: examples for the MNIST-parity experiment, the model has to predict the parity of the digits sum. 6 Experiment In section 3 we showed a family of distributions F that separates linear classes from neural-networks. To validate that our theoretical results apply to a more realistic setting, we perform an experiment that imitates the parity problem using the MNIST dataset. We observe the following simple task: given a strip with k digits chosen uniformly at random from the MNIST dataset, determine whether the sum of the digits is even or odd. We compare the performance of a Re LU network with one hidden-layer, against various linear models. In the case where k = 1, the MNIST-parity task is just a simplified version of the standard MNIST classification task, where instead of 10 classes there are only 2 classes of numbers. In this case, we observe that both the neural-network model and the linear models obtain similar performance, with only slight advantage to the neural-network model. However, when k = 3, the task becomes much harder: it is not enough to merely memorize the digits and assign them to classes, as the model needs to compute the parity of their sum. In this case, we observe a dramatic gap between the performance of the Re LU network and the performance of the linear models. While the Re LU network achieves performance of almost 80% accuracy, the linear models barely perform better than a chance. The results of the experiment are shown in Figure 1. 6.1 Experiment Details In the MNIST-parity experiment we train a neural-network model, as well as various linear models, to predict the parity of the sum of digits in the strip. Our neural-network architecture is a one-hidden layer network with Re LU activation and 512 neurons in the hidden layer. We compare this network to a network of a similar architecture, except that we force the network to stay in the regime of the linear approximation induced by the network s gradient - i.e., the neural-tangent-kernel (NTK) regime [15]. To do this, we use an architecture that decouples the gating from the linearity of the Re LU funcion, and keeps the gates fixed throughout the training process (as suggested in [12]). Namely, we use the fact that: Re LU(hw, xi) = hw, xi 1{hw, xi}, and by decoupling the first and second term during the optimization, we force the network to stay in the NTK regime. We compare these architectures to standard random-features models, where we randomly initialize the first layer, but train only the second layer. Such models are known to be an efficient method for approximating kernel-SVM (see [22]). We use both Re LU random-features (standard random initialization with Re LU activation), and Gaussian random features, which approximate the RBF kernel. Both models have 512 features in the first layer. All models are trained with Ada Delta optimizer, for 20 epochs, with batch size 128. 7 Discussion and Future Work In this work we showed exponential separation between learning neural networks with gradientdescent and learning linear models - i.e., learning linear separators over fixed representation of the data. This shows that learning neural networks is a strictly stronger learning model than any linear model, including linear classifiers, kernel methods and random features. In other words, neural networks are not just glorified kernel methods, as might be implied from previous works in the field. This demonstrates that our current understanding of neural networks learning is very limited, as only a few works so far have given positive results beyond the linear case. There are various open questions which we leave for future work. The first immediate research direction is to find other distribution families that are learnable with neural networks via gradientdescent, but not using linear models. Another interesting question is finding distribution families with separation between deep and shallow networks. Specifically, finding a family of distributions that are learnable with gradient-descent using depth-three networks, but cannot be learned using depth-two networks. Finally, we believe that understanding the behavior of neural networks trained on specific non-linear distribution families will allow us to induce specific properties of the distributions that make them learnable using neural networks. Characterizing such distributional properties is another promising direction for future research. Broader Impact As the primary focus of this paper is on theoretical results and theoretical analysis, a Broader Impact discussion is not applicable. Acknowledgments and Disclosure of Funding This research is partially supported by ISF grant 2258/19. [1] Zeyuan Allen-Zhu and Yuanzhi Li. What can resnet learn efficiently, going beyond kernels? ar Xiv preprint ar Xiv:1905.10337, 2019. [2] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overpa- rameterized neural networks, going beyond two layers. ar Xiv preprint ar Xiv:1811.04918, 2018. [3] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. ar Xiv preprint ar Xiv:1811.03962, 2018. [4] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang. Learning polynomials with neural networks. In Proceedings of the 31st International Conference on Machine Learning, pages 1908 1916, 2014. [5] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. ar Xiv preprint ar Xiv:1901.08584, 2019. [6] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over- parameterized networks that provably generalize on linearly separable data. ar Xiv preprint ar Xiv:1710.10174, 2017. [7] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. ar Xiv preprint ar Xiv:1905.13210, 2019. [8] Amit Daniely. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pages 2422 2430, 2017. [9] Amit Daniely. Neural networks learning and memorization with (almost) no overparameterization. ar Xiv preprint ar Xiv:1911.09873, 2019. [10] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In NIPS, 2016. [11] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. ar Xiv preprint ar Xiv:1810.02054, 2018. [12] Jonathan Fiat, Eran Malach, and Shai Shalev-Shwartz. Decoupling gating from linearity. ar Xiv preprint ar Xiv:1906.05032, 2019. [13] Rong Ge, Runzhe Wang, and Haoyu Zhao. Mildly overparametrized neural nets can memorize training data efficiently. ar Xiv preprint ar Xiv:1909.11837, 2019. [14] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Limitations of lazy training of two-layers neural network. In Advances in Neural Information Processing Systems, pages 9108 9118, 2019. [15] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571 8580, 2018. [16] Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. ar Xiv preprint ar Xiv:1909.12292, 2019. [17] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983 1006, 1998. [18] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. ar Xiv preprint ar Xiv:1902.06720, 2019. [19] Chao Ma, Lei Wu, et al. A comparative analysis of the optimization and generalization property of two-layer neural network and random feature models under gradient descent dynamics. ar Xiv preprint ar Xiv:1904.04326, 2019. [20] Samet Oymak and Mahdi Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? ar Xiv preprint ar Xiv:1812.10004, 2018. [21] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. ar Xiv:1902.04674 [cs, math, stat], February 2019. ar Xiv: 1902.04674. [22] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177 1184, 2008. [23] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. [24] Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3067 3075. JMLR. org, 2017. [25] Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. ar Xiv preprint ar Xiv:1906.03593, 2019. [26] Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target functions. ar Xiv preprint ar Xiv:1611.03131, 2016. [27] Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for under- standing neural networks. ar Xiv preprint ar Xiv:1904.00687, 2019. [28] Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. ar Xiv preprint ar Xiv:1906.04688, 2019.