# on_the_universality_of_deep_learning__53bef1d9.pdf On the universality of deep learning Emmanuel Abbe Mathematics Institute EPFL Lausanne, 1005 Switzerland Colin Sandon Department of Mathematics MIT Cambridge, MA 02139 This paper shows that deep learning, i.e., neural networks trained by SGD, can learn in polytime any function class that can be learned in polytime by some algorithm, including parities. This universal result is further shown to be robust, i.e., it holds under possibly poly-noise on the gradients, which gives a separation between deep learning and statistical query algorithms, as the latter are not comparably universal due to cases like parities. This also shows that SGD-based deep learning does not suffer from the limitations of the perceptron discussed by Minsky-Papert 69. The paper further complements this result with a lower-bound on the generalization error of descent algorithms, which implies in particular that the robust universality breaks down if the gradients are averaged over large enough batches of samples as in full-GD, rather than fewer samples as in SGD. 1 Introduction It is known that the class of neural networks (NNs) with polynomial network size can express any function that can be implemented in a given polynomial time [Par94, Sip06], and that their sample complexity scales polynomially with the network size [AB09]. Thus NNs have favorable approximation and estimation errors. However there is no known efficient training algorithm for NNs with general and provable guarantees, in particular, it is NP-hard to implement the ERM rule [KS09, DSS16]. The success behind deep learning is to train deep NNs with stochastic gradient descent or the like, which gives record performances in various applications [KSH12, HDY+12, LBBH98, LBH15, GBC16]. It is thus natural to ask whether SGD can also control efficiently the third pillar of statistical learning, i.e., the optimization error, turning deep learning into a universal learning paradigm that can learn efficiently any efficiently learnable class; see [SSBD14] for further discussions on this question. This paper answers this question in the affirmative, with the following contributions and implications: 1. It is shown1 that poly-size neural nets trained by SGD with poly-many steps can learn any function class that is learnable by an algorithm that runs in polytime and with poly-many samples; see Theorem 1. This part is resolved using a net initialization that is implemented in polytime (and not dependent on the function to be learned nor the data) and that emulates with SGD any efficient learning algorithm. This shows in particular that SGD-based deep learning is P-complete: any algorithm in P can be reduced to training with SGD a neural net initialized in polytime with a proper non-linearity and evaluating the net (see Remark 2). 1This paper discusses results that are asymptotic in the dimension n of the data, i.e., the number of inputs to the neural net. The term poly is used to emphasize that the considered quantities scale polynomially with n, i.e., nc for a constant c. This can be polynomially large if c 0 or polynomially small if c < 0. In some parts, we make abuse of terminologies and simply say poly/polynomially when it is clear from the context which direction is considered; e.g., only a noise variance that is polynomially small is interesing in our context, not a noise variance that is polynomially large (which would be a huge noise). 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. 2. We further show that this positive result is achieved with robustness: polynomial noise can be added to the gradients and weights can be of polynomial precision and the result still holds; see Theorem 2. Therefore, in a learning theoretic sense, deep learning gives a universal learning paradigm: approximation, estimation and also optimization errors are all controllable with polynomial parameters, and this is not degenerate as it can be implemented with polynomial precision. This also creates a separation between deep learning and statistical query algorithms, as the latter are not comparably universal due to cases like parities [Kea98]. 3. Parities were known to be challenging since the work of Minsky-Papert for the perceptron [MP87], and our positive result requires indeed more than a single hidden layer to succeed, i.e., O(log n) layers2 (see Example 1 in Appendix). In particular, our universality result together with [Raz16] imply that there exists function classes that require large enough nets to be learned with SGD: we know from [Raz16] that a net with o(n2/ log(n)) edges of polynomial precisions cannot learn parities with poly-many samples, and thus with SGD, in polytime (even though one can represent parities with such size and depth 2); our result now shows that a net of size n2 and O(log n) layers can learn parities with SGD in polytime. 4. A lower-bound is derived for descent algorithms on neural nets that shows that learning is impossible with polynomial precision if the junk-flow does not overcome the crosspredictability; see definitions in Section 2 and Theorem 3. The cross-predictability corresponds to an inverse average-case notion of statistical dimension that is classical in SQ algorithms [BFJ+94, Kea98, BKW03, Fel16], and the junk-flow is a quantity that is specific to descent algorithms. This shows in particular that the robust universality does not hold when replacing the stochastic gradients with perfect gradients on the entire population distribution or with large enough polynomial batches of fresh samples, in agreement with the results from SQ algorithms [Kea98, BFJ+94]. Therefore, some small amount of stochasticity3 is needed to obtain the robust universality in our setting. The junk-flow also gives a measure for tackling lower-bounds for gradient descent algorithms more specifically. In a practical setting, there may be no reason to use our SGD replacement to a general learning algorithm, but this universality result emphasizes the breadth of deep learning in the computational learning context and the fact that negative results about deep learning cannot be obtained without further constraints. A natural direction to pursue is typical architectures and initializations. 1.1 Problem formulations and learning objectives We focus on Boolean functions to simplify the setting. Since it is known that any Boolean function that can be computed in time O(T(n)) can also be expressed by a neural network of size O(T(n)2) [Par94, Sip06], it is not meaningful to ask whether any such function f0 can be learned with a poly-size NN and a descent algorithm that can depend on f0; one can simply pre-set the net to express f0. Two more meaningful questions are: (1) Can one learn a given function with an agnostic/random initialization? (2) Can one learn an unknown function from a class or distribution with a proper initialization? For the second question, one is not given a specific function f0 but a class of functions, or more generally, a distribution on functions. Therefore, one can no longer preset the net as desired in an obvious way. We focus here mainly on question 2, which is classical in statistical learning [SSBD14], and which gives a more general framework than restricting the initialization to be random. Moreover, in the case of symmetric function distributions, such as the parities discussed below, failure at 2 implies failure at 1. Namely, if we cannot learn a parity function for a random selection of the support S (see definitions below), we cannot learn any given parity function on a typical support S0 with a random initialization of the net, because the latter is symmetrical. We thus have the following setting: 2One can reduce the number of layers by using threshold gates in the computation component of arbitrary fan-in; see Section 4 3The stochasticity of SGD has also been advocated in contexts such as stability, implicit regularization or to avoid bad critical points [HRS16, ZBH+16, PP17, KLY18]. Let D = {+1, 1} and X = Dn be the data domain and let Y = {+1, 1} be the label domain. We work with binary vectors and binary labels for convenience (several of the results extend beyond this setting with appropriate reformulation of definitions). Let PX be a probability distribution on the data domain X and PF be a probability distri- bution on YX (the set of functions from X to Y). We also assume for convenience that these distributions lead to balanced classes, i.e., that P(F(X) = 1) = 1/2 + on(1) when (X, F) PX PF (non-balanced cases require adjustments of the definitions). Our goal is to learn a function F drawn under PF by observing labelled examples (X, Y ) with X PX , Y = F(X). In order to learn F we can train our algorithm on labelled examples with a descent algorithm starting with an initialization f (0) and running for a number of steps T = T(n) (other parameters of the algorithm such as the learning rate are also specified). In the case of perfect GD, each step accesses the full distribution of labelled examples, while for SGD, it only accesses a single labelled example per step (see definitions below). In all cases, after the training with (f (0), T), the algorithm produces an estimator ˆFf (0),T of F. We say that an algorithm achieves an accuracy of in T time steps for the considered (PX , PF), if a net with initialization f (0) can be constructed such that: P( ˆFf (0),T (X) = F(X)) , (1) where the above probability is over (X, F) (PX PF) and any randomness potentially used by the algorithm. We refer to typical-weak learning when = 1/2 + n(1). In other words, when we can predict the label of a new fresh sample from PX with accuracy strictly better than random guessing. Failing at typical-weak learning implies failing at most other learning requirements, such as PAC learning a class for the case of a uniform distribution on a certain class of functions [BKW03, MRT12]. For our positive results with SGD, we will not only show that one can efficiently typically weakly learn any function distribution that is efficiently typically weakly learnable, but that we can in fact reproduce whatever accuracy an algorithm can achieve for the considered distribution. We also shorten typical-weak learning to simply learning and talk about learning a function distribution or a distribution when referring to learning a pair (PX , PF). Example. The problem of learning parities corresponds to PX being uniform on {+1, 1}n and PF uniform on the set of parity functions defined by P = {ps : s [n]}, where ps : {+1, 1}n ! {+1, 1} is such that ps(x) = Q i2s xi. So nature picks S uniformly at random, and with knowledge of P but not S, the problem is to learn which set S was picked from samples (X, p S(X)). 2.1 Definitions and models We use a fairly generic notion of neural nets, simply weighted directed acyclic graphs with a special vertex for the output, a special set of vertices for the inputs, and a non-linearity at the other vertices. Definition 1. A neural net is defined by a pair of a non-linearity function f : R ! R and a weighted directed graph G with some special vertices and the following properties. G does not contain any cycle and there exists n > 0 such that G has exactly n + 1 vertices that have no edges ending at them, v0, v1,...,vn. We refer to n as the input size, v0 as the constant vertex and v1, v2,..., vn as the input vertices. Further, there exists a vertex vout such that for any other vertex v0, there is a path from v0 to vout in G. We also denote by W = w(G) the weights on the edges of G. We denote by eval(f,G)(x) the evaluation of neural net (f, G) at an input x (or eval(G)(x) if f is implicit). We also use the shortcut notation W(x) for eval G(x), when G is implicit, with a slight abuse of notation between W(G) and W(X) (but the argument in W() clarifies the definition). For a loss function L, a target function h, and a net (f, G), the net s loss at a given input x is L(h(x) eval(f,G)(x)). Note that as we have defined them, neural nets generally give outputs in R rather than {0, 1}. As such, when talking about whether training a neural net by some method learns Boolean functions, we will implicitly be assuming that the output of the net on the final input is thresholded at some predefined value or the like. None of our results depend on exactly how we deal with this part. Definition 2. Let n > 0, 2 [0, 1], PX be a probability distribution on {0, 1}n, and PF be a probability distribution on the set of functions from {0, 1}n to {0, 1}. Also, let X0, X1, ... be independently drawn from PX and F PF. An algorithm learns (PF, PX ) with accuracy in T time steps if the algorithm is given the value of (Xi, F(Xi)) for each i < T and, when given the value of XT PX independent of F, it returns YT such that P(F(XT ) = YT ) . Algorithms such as SGD (or Gaussian elimination from samples) fit under this definition. For SGD, the algorithm starts with an initialization W (0) of the neural net weights, and updates it sequentially with each sample (Xi, F(Xi)) as W (i) = g(Xi, F(Xi), W (i 1)) where g(Xi, F(Xi), W (i 1)) = W (i 1) γr L(eval W (i 1)(Xi), F(Xi)), i < T. It then outputs YT = eval W (T 1)(XT ). For SGD with batch-size m and fresh samples, one has to update the previous definition with not a single sample at each time step but m i.i.d. samples at each time step, computing the empirical average of the query. The extreme case of perfect-GD corresponds to m being infinity. So GD proceeds successively with the following (F, PX )-dependent updates W (i) = EX PX g(X, F(X), W (i 1)) for i < T for the same function g as in SGD. We also consider a noisy version of the above, to ensure that the algorithm is not succeeding due to infinite precision but with robustness. 2.2 Positive results Our first result shows that for any distribution that can be learned by some algorithm in polytime, with poly-many samples and with accuracy , there exists an initialization (which means a neural net architecture with an initial assignment of the weights) that is constructed in polytime and that is agnostic to the function to be learned, such that training this neural net with SGD and possibly poly-noise learns this distribution in poly-steps with accuracy o(1). Theorem 1. For each n > 0, let PX be a probability measure on {0, 1}n, and PF be a probability measure on the set of functions from {0, 1}n to {0, 1}. Next, define = n such that there is some algorithm that takes a polynomial number of samples (Xi, F(Xi)) where the Xi are i.i.d. under PX , runs in polynomial time, and learns (PF, PX ) with accuracy . Then there exists γ = o(1), a polynomial-sized neural net (Gn, φ) constructed in polytime, and a polynomial Tn such that using stochastic gradient descent with learning rate γ to train (Gn, φ) on Tn samples ((Xi, Ri, R0 i), F(Xi)) where4 (Xi, Ri, R0 i) PX Ber(1/2)2 learns (PF, PX ) with accuracy o(1). Remark 1. As a special case, one can construct in poly-time a net (f, g) that has poly-size such that for a learning rate γ and an integer T that are at most polynomial, (f, g) trained by SGD with learning rate γ and T time steps learns parities with accuracy 1 o(1). In other words, random bits are not needed for parities, as parities can be learned with a deterministic algorithm using only samples of the same label without producing bias (see Section 4). Previous theorem also implies that SGD on neural nets can efficiently PAC-learn parities (or any class that is efficiently PAC-learnable). We now show that the previous result can be extended when sufficiently low amounts of inversepolynomial noise are added to the weight of each edge in each time step. In particular, the previous theorem is not a degeneracy due to infinite precision. Theorem 2. For each n > 0, let PX be a probability measure on {0, 1}n, and PF be a probability measure on the set of functions from {0, 1}n to {0, 1}. Let tn polynomial in n. Next, define = n such that there is some algorithm that takes tn samples (Xi, F(Xi)) where the Xi are independently drawn from PX and F PF, runs in polynomial time, and learns (PF, PX ) with accuracy . Then there exists γ = (1), and a polynomial-sized neural net (Gn, f) such that using perturbed stochastic gradient descent with precision noise5 δ 2 [ 1/(n2tn), 1/(n2tn)]tn |E(Gn)|, learning rate γ, and loss function L(x) = x2 to train (Gn, f) on tn samples6 ((Xi, Ri), F(Xi)) where (Xi, Ri) PX Ber(1/2) learns (PF, PX ) with accuracy o(1). Corollary 1. For any c > 0, there exists a universal polytime initialization of a poly-size neural net, such that if samples are produced from a distribution that is learnable with accuracy by some 4We use Ber(1/2) to denote the Bernoulli distribution on {0, 1}. 5This means that each weight at each time step of SGD is perturbed by an amount bounded by 1/(n2tn). 6Formally the samples should be converted to ((2Xi 1, 2Ri 1), 2F(Xi) 1), i.e., valued in { 1, 1} to be consistent with the rest of the notations. algorithm working with an upper bound nc on the number of samples and the time needed per sample, then SGD run in polytime with poly-many samples and possibly inverse-poly noise will succeed in learning the distribution with accuracy on(1). Remark 2. More generally, the process of training a neural net with noisy SGD is P-complete in the following sense. Let A be a polynomial time algorithm that receives a binary string as input and then returns a value in {0, 1}. For every T polynomial in n there exists a neural net (Gn, f), learning rate and inverse polynomial level of noise such that when this net is trained for T time steps on (X0, Y0), ..., (XT 1, YT 1) using noisy SGD and then run on XT it returns A(X0, Y0, X1, Y1, ..., YT 1, XT ) with high probability for all possible (X0, Y0), ..., (XT 1, YT 1), XT . The previous theorem is simply the case of this where A is an algorithm that learns a function from random samples. Remark 3. While the learning algorithm used in Theorem 2 does not put a bound on how large the edge weights can get during the learning process, we can do this in such a way that there is a constant that the weights will never exceed. Recall that in SQ algorithms [Kea98], for a query φ, the oracle returns EXφ(X, F(X)) within an error of , where the range of φ is typically normalized to 1. Note that the expectation is on the population distribution, so SGD is not an SQ algorithm in that sense. In order for SQ to learn a function class, [BFJ+94] shows that the number of queries times the signal-to-noise ratio 1/σ2 has to overcome the statistical dimension of the class. Since the statistical dimension of parities is exponentially large, under polynomially small noise, SQ algorithms cannot learn parities with polynomially many queries. Remark 4. Theorem 2 shows in particular that parities can be learned efficiently by SGD on neural nets (see Example 1 in Appendix for more details), even with an amount of noise that is polynomial and that prevents SQ algorithms from learning parities. Thus, Theorem 2 shows a separation between SGD-based deep learning and SQ algorithms. We further discuss this phenomenon in the next section. 2.3 Negative results 2.3.1 GD and large averages We saw that training neural nets with SGD and polynomial parameters is universal in that it can learn any efficiently learnable distribution. We now give a lower-bound for learning with a family of descent algorithms which includes GD and SGD. This implies in particular that the universality is lost once perfect gradients are used, or once a large number of fresh samples are used to average each gradient, in agreement with the bounds from SQ algorithms [BFJ+94, FGV17, Kea98, BKW03, Fel16, Yan05, FGR+17, SVW15]. The theorem also gives a new quantity, the junk-flow , which can be used to lower bound the performance of descent algorithms beyond the number of queries. Definition 3 (Descent algorithms). Consider for each n > 0 a neural net of size |E(n)| initialized with weights W (0). A descent algorithm running for T time steps is defined by a sequence of query functions {Gt}t2[T ] that rely at each time steps on m samples7, a query range8 of A, a parameter σ2 for the noise variance, and operates by updating at each iterate the weights by W (t) = W (t 1) EX ˆ P Gt 1(W (t 1)(X), F(X)) + Z(t), t = 1, . . . , T (2) where {Z(t)}t2[T ] are i.i.d. N(0, σ2I|E(n)|), {S(t) m }t2[T ] are i.i.d. with S(t) 1 , . . . , X(t) m ) i.i.d. under PX ( ˆPS(t) m denotes the empirical distribution of S(t) m ), and {Z(t)}t2[T ] and {S(t) m }t2[T ] are independent. We use the notation S(