# learning_polynomials_with_neural_networks__3e59925d.pdf Learning Polynomials with Neural Networks Alexandr Andoni ANDONI@MICROSOFT.COM Microsoft Research Rina Panigrahy RINA@MICROSOFT.COM Microsoft Research Gregory Valiant GREGORY.VALIANT@GMAIL.COM Stanford University Li Zhang LZHA@MICROSOFT.COM Microsoft Research We study the effectiveness of learning low degree polynomials using neural networks by the gradient descent method. While neural networks have been shown to have great expressive power, and gradient descent has been widely used in practice for learning neural networks, few theoretical guarantees are known for such methods. In particular, it is well known that gradient descent can get stuck at local minima, even for simple classes of target functions. In this paper, we present several positive theoretical results to support the effectiveness of neural networks. We focus on twolayer neural networks where the bottom layer is a set of non-linear hidden nodes, and the top layer node is a linear function, similar to Barron (1993). First we show that for a randomly initialized neural network with sufficiently many hidden units, the generic gradient descent algorithm learns any low degree polynomial, assuming we initialize the weights randomly. Secondly, we show that if we use complex-valued weights (the target function can still be real), then under suitable conditions, there are no robust local minima : the neural network can always escape a local minimum by performing a random perturbation. This property does not hold for real-valued weights. Thirdly, we discuss whether sparse polynomials can be learned with small neural networks, with the size dependent on the sparsity of the target function. Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). 1. Introduction Neural networks have drawn signification attention from the machine learning community, in part due to the recent empirical successes (see the surveys Bengio (2009; 2013)). Neural networks have led to state-of-the-art systems for crucial applications such as image recognition, speech recognition, natural language processing, and others; see, e.g., (Krizhevsky et al., 2012; Goodfellow et al., 2013; Wan et al., 2013). There are several key concepts that have been instrumental in the success of learning the neural networks, including gradient descent, parallel implementations, carefully-designed network structure (e.g., convolutional networks), unsupervised pre-training (e.g., auto-encoders, RBMs) (Hinton et al., 2006; Ranzato et al., 2006; Bengio et al., 2007), among others. With the flurry of empirical successes and anecdotal evidence suggesting (sometimes conflicting) principles in the design and training of neural networks, there seems to be a need for a more solid theoretical understanding of neural networks. Perhaps the most natural starting point for such investigations is the question of when a neural network provably learns a given target function. A classical result is that, if the target function is linear, the perceptron algorithm will learn it. This is equivalent to saying that gradient descent on a neural network with a single node (or layer) can successfully learn a linear function. But what can we prove about gradient descent on networks with more than one layer? In full generality, this question is nearly hopeless: standard complexity-theoretic results strongly suggest there are no efficient algorithms for learning arbitrary functions (let alone gradient descent), even for target functions representable by very low-depth networks (circuits) (Applebaum et al., 2006). Nevertheless, we may hope to prove concrete results in restricted settings whose structure is reminiscent of the structure present in practical settings. In this work, we consider learning bounded degree polynomials by neural networks. Polynomials are an expressive class of functions since they can be used to approximate many reasonable functions, for example, any Lipshitz function can be well-approximated by a bounded degree polynomial (see, e.g., Andoni et al. (2014)). We establish provable guarantees for gradient descent on two-layer neural networks (i.e., networks with one hidden layer of neurons) for learning bounded degree polynomials. Our main result shows this setting can learn any bounded degree polynomial, provided the neural network is sufficiently large. Specifically, suppose the target function f is a degree d polynomial over an n-dimensional variable x Cn, with x distributed according to a product distribution (for our main results, the distribution can be Gaussian or uniform distributions over the reals). The two-layer network with m hidden units outputs a function g(x) = Pm i=1 αiφ(wi x), where αi C, wi Cn are network parameters, and φ is a non-linear activation function (e.g., a sigmoid). It has been known (Barron, 1993) that, when m nd, there exists some choice of parameters wi, αi so that the network g closely approximates the function f. We show that, in fact, with high probability, even if the bottom layer (wi s) is set to be random, there is a choice for top layer (αi s) such that the neural network approximates the target function f. Hence, the perceptron algorithm, when run on only the top layer while keeping the bottom layer fixed, will learn the correct weights αi. Learning polynomials. Analyzing gradient descent on the entire network turns out to be more challenging as, unlike when the bottom layer is fixed, the error function is no longer convex. We show that even if gradient descent is run on the entire network, i.e. on both top and bottom layers, the neural network will still find the network parameters αi and wi, for which the network approximates the target function f. This can be interpreted as saying that the effect of learning the bottom layer does not negatively affect the overall learning of the target function. Indeed, we show that the learning of the top layer (αi s) happens sufficiently fast, in particular before the bottom layer (wi s) significantly changes. This insight may be useful in analyzing gradient descent for the neural networks with multiple layers. We further explore the landscape of the gradient descent by showing a conceptually compelling, though incomparable result: we show that for sufficiently large networks, in a large region of the parameter space, there are no robust local optima. That is, for any point in the parameter space that satisfies some mild conditions on the weights, with constant probability, a random perturbation will re- duce the error by a non-negligible factor. Hence even when the gradient descent is stuck at a local minimal, a random perturbation would take it out. Because of the conditions on the weights of the neural network, we can not use this theorem to conclude that gradient descent plus random perturbation will always converge to the global optima from any initial neural network initialization; nevertheless, this provides a rigorous, and conceptually compelling explanation for why the existence of local optima do not deter the usage of neural networks in practice. We stress that the random perturbation on the weights is complex-valued but the input can still be real valued. In contrast, the same robustness result does not hold when the random perturbation is strictly real; in this case there are robust local minima such that a random perturbation will increase the error. Intuitively, the complex-valued weights give extra dimensions, and hence nicer geometry, to the local structure of the space of neural networks, which allows it to escape from a local minimum. We find this result intriguing and hope it can further motivate the construction of new types of neural networks. Learning sparse polynomials. It is natural to ask whether, in the case of a target function that is simpler (e.g., because it can be represented by a small number of hidden units), the gradient descent can actually learn it more efficiently. More efficiently here can mean both: 1) with a smaller network, 2) at a faster convergence rate. To study this general question, we suggest the following concrete open problem. Define the sparsity of a polynomial as the number of its nonzero monomial terms. We know, thanks to Barron (1993), a k-sparse degree-d polynomial can be represented by a neural network with nk d O(d) hidden nodes, but can gradient descent learn such a polynomial by a neural network with only (nk dd)O(1) hidden nodes, and in a similar amount of time? We note here we require the target function to be a k-sparse degree-d polynomial, more restrictive than requiring the target to be representable by a small network as in Barron (1994). This might be an easier problem, especially given that recent work has shown that one can indeed learn, for Gaussian or uniform distribution, any sparse polynomial in nk d O(d) time (Andoni et al., 2014), albeit via an algorithm that does not resemble gradient descent. We should emphasize that it is important to consider well-behaved distribution such as Gaussian or uniform distributions. Otherwise, if we allow arbitrary distribution, the sparsity assumption may not be helpful. For example, in the boolean case (where x is distributed over the boolean cube), this is at least as hard as learning parity with noise and learning juntas problems that we suspect do not admit any efficient algorithms and are even used as cryptographic primitives (Alekhnovich, 2003; Regev, 2005; Peikert, 2009; Ap- plebaum et al., 2010). While we do not have the answer to this natural but challenging question, we give some indications why this may indeed be possible. In particular we show that, if the target function depends only on k n variables, then the neural network will learn a function that also depends on these k variables. Additionally, we provide some strong empirical evidence that such small networks are capable of learning sparse polynomials. With the recent success of neural networks in practice, there is an increasing demand of insights and tools for analyzing gradient descent for learning neural networks. Our work represents one such effort. We note that Saxe et al. (2013) is another recent work in this space. We will only sketch proof ideas in this abstract. The full details can be found in the supplementary material. 2. Preliminaries We will mostly use complex-valued inputs and network parameters, to simplify the description. Most results generalize to other distributions. For a complex number c C, write c as its conjugate, and define its norm as |c| = cc. For a matrix (or a vector) P, write P as the adjoint of P, i.e., (P )ij = Pji. We denote by C(r) the uniform distribution of complex number with norm exactly r, N(σ2) the real Gaussian distribution with mean 0 and variance σ2, and U(r) the uniform distribution on [ r, r]. For a distribution P, denote by Pn the n-fold product distribution. For any given distribution D and two functions f, g : Cn 7 C, define the inner product f, g D = Ex D[f(x)g(x)] and the norm f D = p Ex D |f(x)|2. Polynomials. For any J = (J1, , Jn) Nn, write a monomial x J = x J1 1 x Jn n . Define |J| = P k Jk. For a polynomial p(x) = P J b Jx J where b J C, its degree deg(p) = maxb J =0 |J|. All the degree d polynomials form an nd-dimensional linear space. For any given distribution D, a polynomial basis P1, P2, . . . is orthonormal w.r.t. D if Pi, Pj D = 1 if i = j and is zero otherwise. We will use the following basic but important fact that monomials form an orthonormal basis for D = C(1)n. Fact 2.1. w J, w J C(r)n = r2|J| if J = J and 0 otherwise. For Gaussian and uniform distributions, the corresponding orthonormal bases are the Hermite and Legendre polynomials respectively. Neural networks. Assume φ : C 7 C is an analytic function without poles. Write φ(z) = P j 0 ajzj. For any w Cn, define φw : Cn 7 C as J a Jw Jx J . (2.1) The (two-layer) neural network is defined as a function of the form g(x) = P i αiφwi(x). Here φ is called the activation function, and each φwi a hidden unit. We refer to αi s as the upper layer and wi s as the lower layer . In this paper, we will consider activation functions φ(z) = ez and its truncated version φd(z) = Pd k=0 zk/k!, i.e. ez with higher than degree d terms truncated (in this case, we call it a truncated neural network). While a more common φ is the sigmoid function, the particular form of φ is not that important as long as it has good coverage of all the degrees in its Taylor series (see, for example, the discussion in Barron (1993)), and for analysis convenience, it is defined everywhere on the complex plane. Our particular choice is just for the sake of a cleaner analysis. Learning. Here learning is defined to construct the representation of an unknown (or target) function, from some function family F, by given random samples. In this paper, we focus on learning low degree polynomials, where the input distribution is drawn from C(1)n. We will comment when the results can be extended to the other distributions. Learning the neural network via gradient descent proceeds by minimizing the error f g 2 D as a function of weights ws and αs s of φ. To focus on the main ideas, we will assume that there are enough samples such that the empirical estimate of gradient is sufficiently accurate. So we will be working with the model where we do have the access to the gradient. 3. Random Initialization of Neural Network In this section we prove that any polynomial can be represented using a linear combination of a sufficient number of hidden units with weights wi initialized randomly. This is a strengthening of (Barron, 1993), who showed that there exist some weights wi, for which a linear combination of the hidden units yields a given polynomial. In addition, we show that the weights at the top node has small norm. Compared to the representability result in Barron (1993), we require more units, O(n2d), rather than O(nd) there. Theorem 3.1. [Representation Theorem] Let φ(z) = ez and the distribution D = C(1)n. For m O(n2d/ϵ2), we choose m random w1, w2, . . . , wm from the distribution C(1/ n)n. Then, with high probability, for any polynomial p of degree d and norm 1, there exist α1, . . . , αm where P i |αi|2 = O(n2d/m) such that P The result also holds whenever D is Gaussian distribution N(1), or the uniform distribution U(1). To prove the theorem, we consider general φ and D first and then instantiate them by estimating the relevant parameters. The following is immediate from (2.1) and Fact 2.1. Observation 3.2. For any x Cn and any J Nn, Ew C(r)n w Jφw(x) = a Jr2|J|x J. In the following, we will show, constructively, that for the given set of units, how to pick the top layer weights to approximate any given p. Define for any polynomial p(x) = P J b Jx J, r 0, and weight vector w, Tp,r(w) = P J c J(r)b Jw J, where c J(r) 1/(a Jr2|J|). By Observation 3.2, for any x, Ew C(r)n[Tp,r(w)φw(x)] = p(x) . (3.1) Hence Tp,r(w)φw is centered around p in the functional space. We can apply standard concentration bound to show that the average over m such terms, with w chosen independently and m large enough, one can approximate p(x) arbitrarily close. More precisely, suppose that w1, . . . , wm are sampled from C(r)n. Let η(x) denote the error 1 m Pm i=1 Tp,r(w)φwi(x) p(x). We obtain, Ew1,...,wm η 2 D 1 m Ew C(r)n |Tp,r(w)|2 φw 2 D . (3.2) Let a(d) = min|J| d |a J|, and define p 1 = P J |b J|. When w is from C(r)n for r 1, we have |Tp,r(w)| X J |c J(r)b Jw J| = X J |b Jw J/(a Jr2|J|)| by deg(p) = d J |b J/(a Jr|J|)| 1/(a(d)rd) X = p 1/(a(d)rd) . (3.3) Denote by βD(d, r) = Ew C(r)n φw 2 D/r2d and γD(d) = max p D=1 p 1. Plugging (3.3) into (3.2), we have Lemma 3.3. Whenever m γD(d)2βD(d, r)/(ϵa(d))2, with high probability Ew1,...,wm η D ϵ for any p where deg(p) d and p D = 1. Note that the number of required hidden units is determined by various parameters dependent on φ and D. Now consider φ(z) = ez. Then a(d) 1/d!. Suppose that D = C(1)n. Then βD(d, r) = O(e2 nr/r2d). Taking r = O(1/ n), we have βD(d, 1/ n) = O(nd). Further, p D = P J |b J|2, so γD(d) = O( nd). Plugging these parameters into Lemma 3.3, we have m = O(n2d/ϵ2). In addition αi = Tp,r(wi)/m, so i=1 |αi|2 = 1 m2 i=1 |Tp,r(wi)|2 p 2 1 ma(d)r2d = O(n2d/m) . The same bounds can be derived for the distributions such as standard Gaussian distribution in Rn and the uniform distribution in [0, 1]n. This proves Theorem 3.1. 4. Gradient Descent for Polynomials In this section we show that gradient descent on a twolayer neural network can learn a polynomial function, given enough hidden units and small enough learning rate. The statement relies on the fact that the weights have been initialized randomly. The formal statement is in Theorem 4.2. First we prove a warm up (simpler) statement: if we run gradient descent only in the upper layer (and leave intact the weights in the lower layer), then the gradient descent will converge to a network approximating the polynomial up to a small error. We use the representation theorem from the previous section. In this simpler statement, we also assume that we have access to the exact value of the gradient. Then we show a more general statement, which reaches the same conclusion even if we run the generic gradient descent (i.e., on the entire network), assuming that the number of hidden units is sufficiently large. The main intuition behind the result is that the top-layer weights converge to a good state (as in the simplified theorem) before the modifications in the lower layer change enough to affect the overall representability. In particular, one step of a gradient descent will update the weights αi of all the hidden units at once. Hence the total speed of convergence of weights αi is essentially proportional to the number of hidden units, whereas the speed of change of each wi is not. Once we have enough hidden units, namely n O(d), the speed of convergence of αi is sufficiently high to overtake the changes to the weight wi of any particular hidden unit. Theorem 4.1. Fix some degree-d polynomial f of norm 1, and desired error ϵ > 0. Consider a two-layer neural network with m = Ω(n2d/ϵ2) hidden units. We initialize the αi s such that α 1 (e.g., α = 0 would do), and choose the weights wi randomly from C(1/ n)n. Consider the algorithm where we run the gradient descent on the weights in the upper layer (keeping weights in lower layer fixed), with access to exact gradient. Then, for a learning rate λ < 1/m, the algorithm will converge to a network g such that g f ϵ in O n2d λϵ2m steps. Proof. As in preliminaries, g denotes the function produced by the neural network: g(x) = Pm i=1 αiφ(wt i x). Abusing notation, we will think of all functions in the base of monomials x J. In particular, g in the function space is a sum of m vectors p1 . . . pm, where pi = φwi depends on vector wi: g = Pm i=1 αipi. The gradient descent minimizes the quantity E = e, e = e e, for the error function e = f g. We would like to take gradient with respect to αi, but E is not analytic with respect to e. Instead we rely on Wirtinger calculus, and consider the following gradient:1 e = pi e = pi, e . In one step of the gradient descent the new function g is i pi (αi + λ pi, e ), and the new error function: e = f g = e λ X i pi pi, e = where I is the identity matrix. Let P be the matrix whose columns are pi s. Then P i pip i = PP . After l iterations of the gradient descent, the error is e(l) = (I λPP )le(0), where e(0) is the starting error function. We apply the representation Theorem 3.1 to e(0) to obtain e(0) = x+r, where r ϵ, and x can be expressed as x = P i aipi, with a 2 O(n2d/m) for a (a1, . . . am)t. Note that x = Pa. Then we can rewrite the above as: e(l) = (I λPP )l(r+Pa) = (I λPP )lr+(I λPP )l Pa. Let s see what happens after one iteration to the second term: (I λPP )Pa = Pa λPP Pa = P(I λP P)a. Hence, (I λPP )l Pa = P(I λP P)la. Let a(l) (I λP P)la, i.e., e(l) = Pa(l). Suppose Pa(l) ϵ. Then, we have that a(l+1) 2 = (a(l+1)) a(l+1) = ((I λP P)a(l)) (I λP P)a(l) = a(l) 2 2λ Pa(l) 2 + λ2 P Pa(l) 2. (4.1) As we have that P m and λ P 2 1, we conclude that a(l+1) 2 a(l) 2 λ Pa(l) 2 a(l) 2 λϵ2. 1We essentially consider the variables and their conjugates as independent variables, and then take derivative with respect to α i . Finally, note that we can have at most a(0) 2 λϵ2 steps, as a(l) 0 always. Thus, after that many steps, we must have that Pa(l) ϵ, and hence e(l) r + Pa(l) 2ϵ. Since a(0) 2 O(n2d/m), we obtain a total of O n2d ϵ2λm steps. We now continue to our main result, which shows that the gradient descent on the entire network will converge as well, given enough hidden units and small enough learning rate. The proof of the theorem appears in the supplementary material. Theorem 4.2 (General gradient descent). Fix target error ϵ > 0 and degree d 1. Suppose the weights α are initialized to zero and wi s are random C(1/ n)n. Assume the number of hidden units is m = Ω(n6d/ϵ3) and the learning rate is λ 1/4m. Then, given a degree-d polynomial f(x) of unit norm, the gradient descent will converge to a net, which approximates f up to error ϵ. The number of steps required is O n2d λϵ2m , and the number of samples required is M = m O(1). 5. Random Perturbation at Local Minima The results of the previous section show that for a sufficiently large neural network, with high probability over the random initialization of the weights, the gradient descent will learn a degree d polynomial. In this section, we prove a conceptually compelling, though incomparable result: we show that for sufficiently large networks, in a large region of the parameter space, while there may exist local minima, there are no robust local minima. That is, for any point in the specified parameter space, as long as the error is not vanishingly small, with constant probability a random perturbation will reduce the error by at least 1/n O(d) factor (see Theorem 5.1). Because of the conditions on the parameters, we can not use this theorem to conclude that gradient descent will always converge to the global optima from any initial neural network initialization. Nevertheless, this provides a rigorous explanation for why local optima may not be so damaging for neural networks in practice. Furthermore, this perspective may prove useful for going beyond the results of the previous section, for example for addressing the harder questions posed in the later Section 6. We stress that this result relies crucially on the fact that we allow complex valued weights. In fact we also show such a result is not true when the weights are real-valued. We now state the main result of this section. Define the fourth-norm α 4 = (P i |αi|4)1/4. Theorem 5.1. There exist constant c1, c2, c3, c4 > 0 such that for a truncated neural network g = P i αiφwi d where m = Ω(nc1d), wi = O(log n), and α 4 = O( α /nc2d), if e D = Ω( α nc3d), then a perturbation of each wsj drawn from C(1/ n) reduces the total error by a factor of at least 1+1/nc4d, with constant probability. Note that by Theorem 3.1, when m is large enough, m = nΩ(d), the conditions in the theorem can all be met. We first sketch the proof idea. Under a given distribution D, for a target function f and a neural network g, consider the error function e 2 D = g f 2 D. For a local perturbation g + g of g, g + g f 2 D = e 2 D + 2 Re( g, e D) + g 2 D. Hence the change in error is e 2 D = 2 Re( g, e D) + g 2 D. We shall show that, with constant probability, the linear term is negative and overwhelms the quadratic term if the perturbation is sufficiently small. The proof consists of two steps. First we consider a single hidden unit. We show that a local perturbation can create non-negligible correlation with any bounded degree polynomial. Secondly we show that, by the anti-concentration inequality2, when we perturb many hidden units independently, the aggregated correlation is still large and, when the number of hidden units is large enough, exceeds the quadratic term, which can be bounded by the standard concentration bound. Our claim applies when the error function e = g f has bounded degree d, which is the reason the result applies to networks with a truncated activation function φd where φd(z) = P 0 j d ajzj. For the simplicity of notation, we will simply write it as φ(z). Here it is important that w is complex. The distribution D on x can be over the reals, for example, D can be standard Gaussian distribution N(1)n in Rn or the uniform distribution U(1)n over [ 1, 1]n. We first show our statement for D = C(1)n. Then we extend it to the case when D = N(1)n or U(1)n. We will only sketch the main steps of proofs in this section. We give further details of the proof; the missing proofs are in the supplementary material. 5.1. Random perturbation of one hidden unit We first show that for a single hidden unit, a random perturbation will create large correlation with any bounded degree polynomial. Recall that φw(x) = P J a Jw Jx J, and a(d) = min|J| d |a J|. For x Cn, we define x = maxj |xj|. We denote by δφw = φw+δ φw as the perturbation of a hidden unit φw by δ. We have Theorem 5.2. For any x Cn, Eδ C(r)n[ δφw(x)] = 0. For any η such that deg(η) d, and η D 1, we have that for any 0 < r 1 and w r L, Eδ C(r)n Re( δφw, η D)2 = Ω r2da(d)2 nd(L+1)2d . 2This is where we need the condition on α 4. Proof sketch. Clearly for any x, δφw(x) can be written as a polynomial in δ without constant term. By Fact 2.1, Eδ C(r)n[ δφw(x)] = 0. This implies that Eδ C(r)n[ δφw, η D] = 0, Write B(w) = φw, η D. As η is a polynomial with degree d, so is B(w). By the above, we have Eδ C(r)n[B(w + δ)] = B(w) . (5.1) We lower bound Eδ C(r)n[|B(w+δ) B(w)|2] as follows. We first show the case when w = 0. Then we apply the shifting lemma (Lemma 5.4) to complete the proof. Lemma 5.3. For 0 r 1, Eδ C(r)n[|B(δ) B(0)|2] r2da(d)2. Lemma 5.4. Suppose that f is a degree d polynomial on n variables. Let v = (v1, . . . , vn) such that v L. Let fv(x) = f(v + x). Then fv 2 nd(L + 1)2d f 2. By the above two lemmas, we can show that Eδ C(r)n[| δφw, η D|2] = Ω r2da(d)2/(nd(L + 1)2d) , and further transfer this bound to (Re δφw, η D)2, to complete the proof. We note that the above theorem holds for large range of r and w. But to suppress the second order term, we will only need the theorem in the range where r = O(1/ n) and w = O(log n). 5.2. Random perturbation of many hidden units Now consider a neural network g(x) = Pm i=1 αiφwi(x), where each wi = O(log n). Let g (x) = Pm i=1 αiφwi+δi(x), where each δi is i.i.d. from C(1/ n)n. e 2 D e 2 D = (g g) + e 2 D e 2 D = g g 2 D + 2 Re( g g, e D) . (5.2) First consider g g 2 D = Pm i=1 αi δiφwi 2 D. We can view δφw as a vector in the functional space, so each δiφwi is a random vector. We have shown that Eδi[ δiφwi] = 0, and for wi = O(log n) and r = O(1/ n), Eδi C(r)n δiφwi 2 D = O(n O(1)). Since δiφwi s are independent, by standard concentration bound, with high probability g g 2 D = O(n O(1) α 2) . (5.3) For the linear term Re( g g, e D), by using Theorem 5.2 and anti-concentration inequality, we can show that when α 4 α /ncd, with constant probability, say 1/4, Re( g g, e D) = Ω( α e D/n O(d)) . (5.4) Combining (5.2,5.3,5.4), we have whenever α 4 ncd α and α e D/n O(d), we have that e 2 D e 2 D α e D/n O(d) with constant probability. Hence, we have proved the main theorem. 5.3. Extension to distributions on reals The above proof can also be extended to the case where the x is not complex but chosen from a Gaussian distribution N(1)n in Rn or uniform distribution U(1)n on [ 1, 1]n. This follows from the following observation that relates the norm of a polynomial under different distributions. Observation 5.5. Let P(x) be a degree d polynomial. Then P D = Ω(1/dd/2) P C(1)n and O(dd/2 P C(1)n), where D = N(1)n or U(1)n. The above observation implies that if we replace C(1)n by N(1)n or U(1)n, the bound in Theorem 5.2 is only affected by a factor dependent on d only (dd or 2d). Since we assume d to be constant, we have: Corollary 5.6. The same statement in Theorem 5.1 holds when D = N(1)n or D = U(1)n. 5.4. Robust local minima for real weights The perturbation theorem 5.1 uses random perturbation in the complex plane to escape a local minimum. It is natural to ask whether real-valued perturbation would be sufficient instead. We show that this is not the case: there are examples, where a real-valued perturbation does not improve the error. This suggest that using complex perturbations may be useful. Lemma 5.7. Consider a network with activation function φ(z) = Pd l=0 alzl, where al 0, on a neural network with one layer of hidden units, with real weights; the input distribution is x U(1)n. There exist a set of network parameters where the gradient is zero and a random perturbation on the weights {wi}i goes in the direction away from the target function, i.e., e D > e D with high probability. Proof. We will give a construction of a set of parameters, which are a local minimum for the gradient descent, and a real-valued random perturbation of weights is expected to increase the error. This point is where all hidden units have identical weights wi = 0 and αi = 1/m (for m hidden units), so that P αi = 1. We show why local perturbation does not reduce error, except with a very small probability. Let δi be the perturbation in weight wi; for concreteness, suppose each perturbation is uniform from [ λ, λ] for some λ 1/m. Let g = g g denote the change in the output function of the neural net. We argue that Eδi[ g] is non-zero. Note that the change g can be written as a polynomial in the change in the weights δij. Fix one particular hidden unit i, and fixed j [n], and consider the term corresponding to second Legendre polynomial in xj, L2(xj) (remember that the Legendre polynomials are the basis for our input distribution, so we are considering one coordinate in the polynomial basis). We claim that its coefficient is positive: in fact it is a (positive) combination of even powers of δi,j for all j [n]. In particular, say in a term al(δix)l, we have contribution to L2(xj) only from terms of the form δJx J, where vector J is even (any other term has correlation 0 with L2(xj)). We can now choose e = g f so that e, g is positive by choosing f(x) = g(0) P j L2(xj). Then the change in e, e is: e, e = e , e e, e = 2 e, g + g, g . The error strictly increases with probability 1. It is also clear why the gradient is zero: g wij , when all wi = 0, is composed only of a linear in x terms, whereas e = g f is has no constant or linear terms. We remark that the above holds even if we perform a small random perturbation on the weights αi as well (it suffices that αi and perturbed versions remain positive). 6. Learning sparse polynomials In this section we study whether smaller neural networks are sufficient for learning sparse polynomials (containing few monomials). As an intermediary step towards this goal, we will also consider the setting where the polynomial only depends on a small subset of the n variables (and hence is also sparse). These questions can be viewed as clean and potentially theoretically tractable special cases of the general question of understanding the relation between the representation complexity and the learning complexity of neural networks for some given class of functions. 6.1. Learning n-sparse polynomials Can a neural network with O(n) hidden units learn a quadratic or cubic polynomial that has n monomials? We provide strong empirical evidence (see Fig. 1) suggesting that, for the case of n-sparse polynomials over n variables, a neural network with O(n) hidden units can learn the function. We train the net using 5n hidden units while varying n through the values 10, 20, 40, and 80. The polynomial is constructed using randomly chosen n monomials. The plots show that the training error drops significantly after a reasonable number of iterations that depends on n. 6.2. Learning polynomials over few variables As an intermediary step towards the sparse polynomial case, we investigate whether a small neural network suffices to learn a sparse polynomial which also depends only 0 500 1000 1500 2000 2500 3000 0 Training Error Learning n Sparse Quadratic Polynomials Over n Variables n=10 n=20 n=40 n=80 0 0.5 1 1.5 2 Training Error Learning n Sparse Cubic Polynomials Over n Variables n=10 n=20 n=40 n=80 Figure 1. In the above plots the neural networks had 5n hidden units, and the polynomials were chosen by selecting each of the n monomials uniformly at random from the O(n2) (in the quadratic case), or O(n3) (in the cubic case) possible monomials. on k variables. Here, a simpler goal may be to prove that gradient descent learns the polynomial using only k O(d) hidden units instead of n O(d). We are unable to prove this but provide some evidence in this direction. First we show (Lemma 6.1) that assuming x C(1)n, at the termination of the gradient descent, the final function output by the net will depend on the k relevant variables only. We show a similar result for the (more realistic) case where x N(1)n, for a specific transfer function ψ, built using Hermite polynomials. We will use HJ(x) to denote the Hermite polynomial over x corresponding to a vector J of degrees in the n variables in x. (Some details are deferred to full version.) Lemma 6.1. If x C(1)n and the target function f does not depend on a variable xi, then: whenever the gradient descent converges to a point with zero gradient, the output g does not depend on xi. This also holds for the input distribution x N(1)n, provided one uses a special activation function ψw(x) = P J a JHJx Jw J. Proof. The main idea is that if a variable, say x1, is not used in f, then the error e can be written as a sum of two polynomials, one that does not involve x1 and another that has x1 in every monomial and these two are orthonormal. Let e = f g = f P s αiψ(wi x). By expanding the polynomial ψ and gathering terms that are dependent on x1 and others we get e = q(x2, x3, .., xn) + h(x) where h(x) = P J c Jw J i x J where each J has non zero degree in x1 (that is J1 is non zero) and q does not depend on either x1 (or its weights wi,1). All we need to show that P i | e wi,1 , e |2 is non-zero. So the partial derivative of the error with respect to wi,1 is e wi,1 = wi,1 ( X J c Jw J k x J) = αi X J c JJ1w J i x J/wi,1 . Therefore X i wi,1 e wi,1 = X J c JJ1w J i x J = X i αic Jw J i x J i wi,1 e wi,1 , e i αic Jw J i )x J, X i αic Jw J i )x J i αic Jw J|2 , which must be non-zero as h = P J c Jw J i x J = P i αic Jw J i )x J is non-zero (each J1 is non-zero). This means that P i wi,1 wi,1 e, e is non-zero. Thus at least one of the wi,1 e, e is non-zero, which means that the gradient descent has not converged yet. A similar proof works for the case when x is real and we instead use a modified polynomial ψ where each monomial is replaced by a Hermite polynomial. The proof is based on the orthonormality of these polynomials over N(1)n. Further we point out that neural networks are able to learn polynomials based on their best possible sparsity under any orthonormal transform; i.e., the notion of sparsity is independent of the chosen coordinate system. This is because the neural net works with dot products of the input point. Observation 6.2. If a neural network can learn a k-sparse polynomial in time T(k, d, n), then it can learn also learn any polynomial that is k-sparse under any orthonormal transform in the same time complexity. Proof. This follows from the rotational invariance of the gradient descent process when g is a function of wt i.x over the different hidden units i. That is rotating the coordinate system does not change the gradient descent process. In order to show the gradient descent succeeds with k O(d) units, we need to assume a High-Rank Condition (a variant of the condition in Theorem 3.1) for similar analysis to Section 4 to hold. But we are currently unable to prove the high-rank condition and leave it as an open question. Alekhnovich, Michael. More on average case vs approximation complexity. In Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2003. Andoni, Alexandr, Panigrahy, Rina, Valiant, Gregory, and Zhang, Li. Learning sparse polynomial functions. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 2014. Applebaum, Benny, Ishai, Yuval, and Kushilevitz, Eyal. Cryptography in ncˆ0. SIAM Journal on Computing, 36 (4):845 888, 2006. Applebaum, Benny, Barak, Boaz, and Wigderson, Avi. Public-key cryptosystem from different assumptions. In Proceedings of the Symposium on Theory of Computing (STOC), 2010. Barron, Andrew R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930 945, 1993. Barron, Andrew R. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14: 115 133, 1994. Bengio, Y. Learning deep architectures for ai. Foundations and Trends in Machine Learning, 2009. Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layer-wise training of deep networks. In NIPS, 2007. Bengio, Yoshua. Deep learning of representations: Looking forward. ar Xiv preprint ar Xiv:1305.0445, 2013. Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. Maxout networks. In ICML, 2013. Hinton, G. E., Osinderoand, S., and Teh, Y. A fast learning algorithm for deep belief nets. Neural Computation, 18: 1527 1554, 2006. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. Peikert, Chris. Public-key cryptosystems from the worstcase shortest vector problem. In Proceedings of the Symposium on Theory of Computing (STOC), 2009. Ranzato, M., Poultney, C., Chopra, S., and Le Cun, Y. Efficient learning of sparse representations with an energybased model. NIPS, 2006. Regev, Oded. On lattices, learning with errors, random linear codes, and cryptography. In STOC 05: Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pp. 84 93, New York, NY, USA, 2005. ACM. ISBN 1-58113-960-8. doi: http: //doi.acm.org/10.1145/1060590.1060603. Saxe, Andrew M, Mc Clelland, James L, and Ganguli, Surya. Dynamics of learning in deep linear neural networks. NIPS Workshop on Deep Learning, 2013. Wan, Li, Zeiler, Matthew, Zhang, Sixin, Le Cun, Yann, and Fergus, Rob. Regularization of neural networks using dropconnect. In ICML, 2013.