# efficient_deep_approximation_of_gmms__ee6154bb.pdf Efficient Deep Approximation of GMMs Shirin Jalali, Carl Nuzman, Iraj Saniee Bell Labs, Nokia 600-700 Mountain Avenue Murray Hill, NJ 07974 {shirin.jalali,carl.nuzman,iraj.saniee}@nokia-bell-labs.com The universal approximation theorem states that any regular function can be approximated closely using a single hidden layer neural network. Some recent work has shown that, for some special functions, the number of nodes in such an approximation could be exponentially reduced with multi-layer neural networks. In this work, we extend this idea to a rich class of functions, namely the discriminant functions that arise in optimal Bayesian classification of Gaussian mixture models (GMMs) in Rn. We show that such functions can be approximated with arbitrary precision using O(n) nodes in a neural network with two hidden layers (deep neural network), while in contrast, a neural network with a single hidden layer (shallow neural network) would require at least O(exp(n)) nodes or exponentially large coefficients. Given the universality of the Gaussian distribution in the feature spaces of data, e.g., in speech, image and text, our results shed light on the observed efficiency of deep neural networks in practical classification problems. 1 Introduction There is a rapidly growing literature which demonstrates the effectiveness of deep neural networks in classification problems that arise in practice; e.g., in audio, image or text classification. The universal approximation theorem, UAT, see [1, 2, 3, 4], states that any regular function, which for example separates in the (high dimensional) feature space a collection of points corresponding to images of dogs from those of cats, can be approximated by a neural network. But UAT is proven for shallow, i.e., single hidden-layer, neural networks and in fact the number of nodes needed may be exponentially or super exponentially large in the ambient dimension of feature space. Yet, practical deep neural networks are able to solve such classification problems effectively and efficiently, i.e., using what amounts to a small number of nodes in terms of the size of the feature space of the data. There is no theory yet as to why deep neural networks (DNNs from here on) are as effective and efficient in practice as they evidently are. There are essentially two possibilities for this observed outcome: 1) DNNs are always significantly more efficient in terms of the number of nodes used for approximation of any relevant function than shallow networks, or 2) DNNs are particularly suited to discriminant functions that arise in practice, e.g., those that separate in the feature space of images, points representing dogs from points representing cats. If the latter proposition is true, then the observed efficiency of DNNs is essentially due to the special form of the discriminant functions encountered in practice and not to the universal efficiency of DNNs, which the former proposition would imply. The first alternative proposed above is a general question about function approximation given neural networks as the collection of basis functions. As of today, there are no general results that show DNNs (those with two or more hidden layers) require fundamentally fewer nodes for approximation of general functions than shallow neural networks (or SNNs from here on, i.e., those with a single hidden layer). In this paper, we focus on the second alternative and provide an answer in the affirmative; that 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. indeed many discriminant functions that arise in practice are such that DNNs require significantly, i.e., logarithmically, fewer nodes for their approximation than SNNs. To formalize what may constitute discriminant functions that arise in practice, we focus on a versatile class of distributions often used to model real-life distributions, namely Gaussian mixture model (GMM for short). GMMs have been shown to be good models for audio, speech, image and text processing in the past decades, e.g., see [5, 6, 7, 8, 9]. 1.1 Background The universal approximation theorem [1, 2, 3, 4] states that shallow neural networks (SNNs) can approximate regular functions to any required accuracy, albeit potentially with an exponentially large number of nodes. Can this number be reduced significantly, e.g., logarithmically, by deep neural networks? As indicated above, there is no such result as of yet and there is scant literature that even discusses this question. Some evidence exists that DNNs may in fact not be efficient in theory, see [10]. On the other hand, some special functions have been constructed for which DNNs achieve significant and even logarithmic reduction in the number of nodes compared to SNNs, e.g., see [11, 12, 13] for a special radial function, functions of the form f(x, x ) = g( x, x ) (f : Sn 1 Sn 1 R and g : [ 1, 1] R), and polynomials, respectively. However, the functions considered in these references are typically very special and have little demonstrated basis in practice. Perhaps the most illustrative cases are the high degree polynomials discussed in [13], but the logarithmic reduction in the number of nodes due to depth of the DNNs demonstrated in this work occurs only for very high degrees of polynomials in the feature coordinate size. In this work we are motivated by model universality considerations. What models of data are typical and what resulting discriminant functions do we typically need to approximate in practice? With a plausible model, we can determine if the resulting discriminant function(s) can be approximated efficiently by deep networks. To this end, we focus on data with Gaussian feature distributions, which provide a practical model for many types of data, especially when the feature space is sufficiently concentrated, e.g. after a number of projections to lower-dimensional spaces, e.g., see [14]. Our overall framework is based on the following set of definitions and demonstrations that we describe in detail in the following sections. Section 1.2 defines an L-layer neural network. Section 1.3 reviews the problem of optimal classification of a collection of high-dimensional GMMs. The classifier function for GMMs is readily seen to be the maximum of multiple discriminant functions consisting of sums of exponentials of quadratic functions in dimension n. Section 2 establishes connections between approximating the defined discriminant functions with the described classification problem. Section 3 demonstrates that DNNs can approximate general n-dimensional GMM discriminant functions using O(n) nodes. In Section 4, we show that, in contrast to DNNs, SNNs need either an exponential (in n) number of nodes, or exponentially large coefficients to approximate discriminant functions of GMMs. In Section 5, we show sufficiency of an exponential number of nodes by studying an SNN where the weights in the first layer are drawn from a random distribution. Notation. Throughput the paper, bold letters, such as x and y, refer to vectors. Sets are denoted by calligraphic letters, such as X and Y. For a discrete set X, |X| denotes its cardinality. 0n denotes the all-zero vector in Rn. In denotes the n-dimensional identity matrix. For x Rn, x 2 = Pn i=1 x2 i . 1.2 L-layer neural networks and the activation function σ L-layer Neural Network. Consider a fully-connected neural network with L hidden layers. We refer to a network with L = 1 hidden layer as an SNN and to a network with L > 1 hidden layers as DNN. Let x Rn denote the input vector. The function generated by an L-layer neural network, f : Rn Rc, can be represented as a composition of affine functions and the non-linear function σ, as follows f(x) = σ T [L+1] σ T [L] . . . σ T [1](x). Here, T [ℓ] : Rnℓ 1 Rnℓdenotes the affine mapping applied at layer ℓ, represented by linear transformation W [ℓ] Rnℓ nℓ 1 and translation b[ℓ]. Moreover, σ : R R denotes the non-linear function that is applied element-wise. In this definition, nℓ, ℓ= 1, . . . , L, denotes the number of hidden nodes in layer ℓ. To make the notation consistent, for ℓ= 0, let n0 = n, the dimension of the input data, and for ℓ= L + 1, n L+1 = c, the number of classes. We will occasionally use the notations dnn and snn to signify that the cases of L > 1 and L = 1, respectively. In a classification task, the index of the highest value in the output tuple determines the optimal class for x. Non-linear activation function σ. As discussed later, for the two-layer construction in Section 3, we require some regularity assumptions on the activation function σ, which are met by typical smooth DNN activation functions such as the sigmoid function. In Section 7, we indicate how to refine the proofs to accommodate the popular and simple Re LU activation function. The proof of the inefficiency of SNNs in Section 4 applies to a very general class of activation functions. 1.3 GMMs and their optimal classification functions Consider the problem of classifying points generated by a mixture of Gaussian distributions. Assume that there are c classes and samples of each class are drawn from a mixture of Gaussian distributions. Assume that there are overall k different Gaussian distributions to draw from. For j {1, . . . , k}, let µj Rn and Σj Rn n denote the mean and the covariance matrix of Gaussian distribution j. Each Gaussian distribution is assigned uniquely to one of the c classes. Assume that the assignment of the Gaussian clouds to the classes is represented by sets T1, . . . , Tc, which form a partition of {1, . . . , k}. (That is, Ti Tj = , for i = j, and c i=1Ti = {1, . . . , k}.) Set Ti represents the indices of the Gaussian distributions corresponding to Class i. Let φi, i = 1, . . . , c, denote the probability that a data point belongs to Class i. Finally, for Class i, let wj, j Ti, denote the probability that within Class i, the data comes from Gaussian distribution j. Hence, for i = 1, . . . , n, P j Ti wj = 1. Under the described model, and with a slight abuse of notation, the data is distributed as Pc i=1 φi P j Ti wj N(µj, Σj). Conditioned on being in Class i, the data points are drawn from a mixture of |Ti| Gaussian distributions as P j Ti wj N(µj, Σj). For j = 1, . . . , k, let πj : Rn R denote the probability density function (pdf) of the Gaussian distribution with mean µj and covariance matrix Σj. An optimal Bayesian classifier1 C : Rn {1, . . . , c} for these c GMMs maximizes the probability of membership across all classes. For Class i, define the i-th discriminant function di : Rn R, as j Ti wjγj exp( gj(x)), (1) where gj(x) 1 2(x µj)T Σ 1 j (x µj) and γj (2π) n 2 . Using this definition, the optimal classifier C can be characterized as C (x) = arg maxi {1,...,c} di(x). (2) 2 Connection between classification and approximation The main result of this paper is that the discriminant functions described in (1), required for computing optimal classification function C (x), can be approximated accurately by a relatively small neural network with two hidden layers, but that accurate approximation with a single hidden layer network is only possible if either the number of the nodes or the magnitudes of the coefficients are exponentially large in n. Before stating our main results, in this section, we establish a connection between the accuracy in approximating the discriminant functions of a classifier and the error performance of a classifier that employs these approximations. Given a non-negative function d(x), d : Rn R, and threshold t > 0, let Sd,t denote the superlevel set of d(x) defined as Sd,t {x Rn : d(x) t} . Definition 1 A function ˆd : Rn R is a (δ, q)-approximation of a non-negative function d : Rn R under a pdf p, if there is a threshold t, such that Pp[Sd,t] 1 q, and | ˆd(x) d(x)| δd(x), x Sd,t (3) 0 ˆd(x) (1 + δ)t, x Sd,t. (4) Let t ˆd,δ,q denote the corresponding threshold. If there are multiple such thresholds, let t ˆd,δ,q denote the infimum of all such thresholds. 1Throughout the paper, a Bayesian classifier refers to a classifier that has access to the distribution of the data. In this definition, ˆd closely approximates d in a relative sense, wherever d(x) exceeds threshold t. The function ˆd is small (in an absolute sense), when d(x) is small, an event that occurs with low probability under p. Although p and d need not be related in this definition, we will typically use it in cases where d is just a scaled version of p. Given two equiprobable classes with pdf functions p1 and p2, the optimal Bayesian classifier chooses class 1, if p1(x) > p2(x), and class 2 otherwise. Let e21,opt = P1[p2(x) > p1(x)] denote the probability of incorrectly deciding class 2, when the true distribution is class 1. If we classify using approximate pdfs with relative errors bounded by α 1, then the probability of error increases to e21,opt[α] := P1[p2(x) > p1(x)/α]. Under appropriate regularity conditions, e21,opt[α] approaches e21,opt, as α converges to 1. Lemma 1 below shows that (δ, q)-approximations of p1 and p2 enable us to approach e21,opt, by taking δ and q sufficiently small. Lemma 1 Given pdfs p1 and p2, let ˆd1 and ˆd2 denote (δ, q)-approximations of discriminant functions d1 = p1 and d2 = p2 under distributions p1 and p2, respectively. Define ti, i = 1, 2, as ti t ˆdi,δ,q. Consider a classifier that declares class 1 when ˆd1(X) > ˆd2(X) and class 2 otherwise. Then, the probability of error of this classifier, under distribution p1, is bounded by e21 e21,opt[ 1+δ 1 δ] + q + P1(Sc d1,(1+δ)t2/(1 δ)), where P1(E) measures the probability of event E under p1. The proof of Lemma 1 is presented in Section 1 of the supplementary material (SM). Note that as q converges to zero, both t1 and t2 converge to zero as well. Therefore, letting q converge to zero ensures that P1((1 + δ)t2 (1 δ)d1(x)) also converges to zero. One way to construct a nearly optimal classifier for two distributions is thus to independently build a (δ, q)-approximation for each distribution, and then define the classifier based on maximum of the two functions. With this motivation, in the rest of the paper, we focus on approximating the discriminant functions defined earlier for classifying GMMs. In the next section, we show that using a two hidden-layer neural network, we can construct a (δ, q)-approximation ˆd of the discriminant function of a GMM, see (1), with input dimension n, with O(n) nodes, for any δ > 0 and any q > 0. In the subsequent section, we show by contrast that even for the simplest GMM consisting of a single Gaussian distribution, even a weaker approximation that bounds the expected ℓ2 error cannot be achieved by a single hidden-layer network, unless the (neural) network has either exponentially many nodes or exponentially large coefficients. The weaker definition of approximation that we will use in the converse result is the following. Definition 2 A function ˆd : Rn R is an ϵ-relative ℓ2 approximation for a function d : Rn R under pdf p, if Ep[( ˆd(x) d(x))2] ϵ Ep[(d(x))2]. The following lemma shows that if approximation under this weaker notion is impossible, then it is also impossible under the stronger (δ, q) notion. Lemma 2 If ˆd is a (δ, q)-approximation of a distribution d under distribution p, then it is also an ϵ-relative ℓ2 approximation of d, with parameter ϵ = δ2 + (1+δ)2q Proof of Lemma 2 is presented in Section 2 of the SM. 3 Sufficiency of two hidden-layer NN with O(n) nodes We are interested in approximating the discriminant functions corresponding to optimal classification of GMMs, as defined in Section 1.3. In this section, we consider a generic function for (1) as j=1 βj exp ( gj(x)), (5) where gj(x) = 1 2(x µj)T Σ 1 j (x µj) and βj = φwjγj with fixed prior φ, conditional probabilities wj, and γj = (2π) n We first observe that the function gj(x) is a general quadratic form in Rn and thus consists of the sum of O(n2) product terms of the form xixj. Since each such product term can be approximated via four σ functions (see [15]), gj(x) can be approximated arbitrarily well using O(n2) nodes. We can however reduce the number of nodes further by applying the affine transformation yj = Σ 1/2 j (x µj) in the first layer, so that gj(x) is simply yj 2 = Pn i=1 y2 j,i, i.e., it consists of n quadratic terms, and can in turn can be approximated via O(n) nodes. This O(n2) to O(n) reduction in the number of nodes is specific to quadratic polynomials which are generic exponents of GMMs and their discriminant functions and we will take advantage of this reduction in our proofs. In order to prove the main result of this section, we rely on the following regularity assumptions on the activation function σ(x). All of the assumptions are satisfied by the sigmoid function σ(x) = 1 1+e x , for example. Assumption 1 (Curvature) There is a point τ R, and parameters r and M, such that σ(2)(τ) > 0, such that σ(3)(x) exists and is bounded, |σ(3)(x)| M, in the neighborhood τ r x τ + r. Assumption 2 (Monotonicity) The symmetric function σ(x + τ) + σ( x + τ) is monotonically increasing for x 0, with τ as defined in Assumption 1. Assumption 3 (Exponential Decay) There is η > 0 such that |σ(x)| exp(ηx) and |1 σ(x)| exp( ηx). Assumptions 1 and 2 can be used to construct an approximation of x2 using O(1) nodes for each such term. They are satisfied for example by common activation functions such as the sigmoid and tanh functions. Assumption 3 relied upon to construct an approximation of exp(x) in the second hidden layer, with O(n) nodes in each of the J subnetworks. This assumption is met by the indicator function u(x) = 1{x > 0}, and any number of activation functions that are smoother versions of u(x), including piecewise linear approximations of u(x) constructed with Re LU, and the sigmoid function. The following is our main positive result about the ability to efficiently approximate GMM discriminant functions with two-layer neural networks. Theorem 1 Consider a GMM with discriminant function d : Rn R+ of the form (5), consisting of Gaussian pdfs with bounded covariance matrices. Let the activation function σ : R R satisfy Assumptions 1, 2, and 3 . Then for any given δ > 0 and any q (0, 1), there exists a two-hidden-layer neural network consisting of M = O(n) instances of the activation function σ and weights growing as O(n5), such that its output function ˆd is a (δ, q)-approximation of d, under the distribution of the GMM. The detailed proof of Theorem 1 is presented in Section 5 of the SM. The proof relies on several lemmas that are stated and proved in Section 4 of the SM. Remark 1 Applying Theorem 1 to a collection of c GMMs gives rise to a DNN with O(n) nodes that approximates the optimal classifier of these GMMs via (2). Remark 2 The construction of an O(n)-node approximation of the GMM discriminant function assumes that the eigenvalues of the covariance matrices are bounded from above and also bounded away from zero, by constants independent of n. To prove Theorem 1, given d(x) = PJ j=1 βj exp ( gj(x)), we build a neural net consisting of J sub-networks, with sub-network j approximating βj exp ( gj(x)). For convenience, denote by cj(x) = βj exp( gj(x)), the desired output of the j-th subnetwork. The J subnetwork function approximations ˆcj(x), j = 1, . . . , J, are summed up to get the final output ˆd(x) = P Lemma 3 Given δ > 0, q > 0, and the GMM discriminant function d(x) = PJ j=1 cj(x), let t be such that P[Sd,t ] 1 q under pdf p(x) = d(x)/φ . Define λ = (t δ)/(2J(1 + δ)), and for each j, suppose we have an approximation function ˆcj of cj such that |ˆcj(x) cj(x)| δ/2cj(x), if cj(x) λ, 0 ˆcj(x) λ(1 + δ), otherwise Then ˆd(x) = P j ˆcj(x) is a (δ, q)-approximation of d(x) under p( ). The proof is presented in Section 3 of the SM. This lemma establishes a sufficient standard of accuracy that we will need for the subnetwork associated with each Gaussian component. In particular, there is a level λ such that we need to have relative error better than δ/2 when the component function is greater than λ. Where the component function is smaller than λ, we require only an upper bound on the approximation function. The critical level λ is proportional to t , which is a level achieved with high probability by the overall discriminant function d. The scaling of the level t with n is an important part of the proof, analyzed later in Lemma 5 of the SM. 4 Exponential size of SNNs for approximating the GMM discriminant functions In the previous section, we showed that a DNN with two hidden layers, and O(n) hidden nodes is able to approximate the discriminant functions corresponding to an optimal Bayesian classifier for a collection of GMMs. In this section, we prove a converse result for SNNs. More precisely, we prove that for an SNN to approximate the discriminant function of even a single Gaussian distribution, the number of nodes needs to grow exponentially with n. Consider a neural network with a single hidden layer consisting of n1 nodes. As before, let σ : R R denote the non-linear function applied by each hidden node. For i = 1, . . . , n1, let wi Rn, and bi R denote the weight vector and the bias corresponding to node i, respectively. The function generated by this network can be written as f(x)= Pn1 i=1 aiσ( wi, x + bi) + a0. (6) Suppose that x Rn is distributed as N(0n, sx In), with pdf µ : Rn R. Suppose that the function to be approximated is µc(x) sf + 2sx 4 e 1 2sf x 2 , (7) which has the form of a symmetric zero-mean Gaussian distribution with variance sf in each direction, and has been normalized so that E[µ2 c(x)] = 1. Our goal is to show that unless the number of nodes n1 is exponentially large in the input dimension n, the network cannot approximate the function µc defined in (7) in the sense of Definition 2. Our result applies to very general activation functions, and allows a different activation function in every node; essentially all we require is that i) the response of each activation function depends on its input x through a scalar product wi, x , and ii) the output of each hidden node is square-integrable with respect to the Gaussian distribution µ(x). Incorporating the constant term into one of the activation functions, we consider a more general model f(x) = Pn1 i=1 aihi( wi, x ) (8) for a set of functions hi : R R. To avoid scale ambiguities in the definition of the coefficients ai and the functions hi, we scale hi as necessary so that, for i = 1, . . . , n1, wi = 1 and E[(hi( wi, x ))2] = 1. Our main result in this section shows the connection between the number of nodes (n1) and the achievable approximation error E[|µc(x) f(x)|2]. We focus on approximation under Definition 2, which, as shown in Lemma 2, is weaker than the notion used in Theorem 1. Therefore, proving the lower bound under this notion, automatically proves the same bound under the stronger notion too. Theorem 2 Consider µc : Rn R and f : Rn R defined in (7) and (8), respectively, for some sf > 0. Suppose that random vector x N(0n, sx In), where sx > 0. For i = 1, . . . , n1, assume that wi = 1, and E[(hi( wi, x ))2] = 1, for activation function hi : R R. Then, E[|µc(x) f(x)|2] 1 2 n1 a (1 + sx/sf)1/4 ρ n/4, (9) ρ 1 + s2 x s2 f + 2sxsf > 1. (10) The proof of Theorem 2 is presented in Section 6 of the SM. This result shows that if we want to form an ϵ-relative ℓ2 approximation of µc, in the sense of Definition 2, with an SNN, n1 must satisfy n1 1 ϵ 2A(1+sx/sf )1/4 ρn/4, where A = 1 n1 a denotes the root mean-squared value of a. That is, the number of nodes need to grow exponentially with n, unless the norm of the final layer coefficients vector a grows exponentially in n as well. Note that in the natural case sf = sx where the discriminant function to be approximated matches the distribution of the input data, the required exponential rate of growth is ρn/4 = (4/3)n/4. Remark 3 The generalized model (8) covers a large class of activation functions. It is straightforward to confirm that the required conditions are satisfied by bounded activation functions, such as the sigmoid function or the tanh function, with arbitrary bias values. For the popular Re LU function, hi( wi, x ) = max(| wi, x + bi|, 0). Therefore, E[|hi( wi, x )|2] E[( wi, x + bi)2] = sx + b2 i , which again confirms the desired square-integrability property. Remark 4 From the point of numerical stability, it is natural to require the norm of the final layer coefficients, a , to be bounded, as the following simple argument shows. Suppose that network implementation can compute each activation function hi(x) exactly, but that the implementation represents each coefficient ai in a floating point format with a finite precision. To gain intuition on the effect of this quantization noise, consider the following modeling. The implementation replaces ai with ai + zi, where E[zi] = 0 and E[z2 i ] = ν|ai|2. Further assume that z1, . . . , zn1 are independent of each other and of x. In this model, ν reflects the level of precision in the representation. Then, the error due to quantization can be written as i z2 i (hi( wi, x ))2 # i ν|ai|2 E h (hi( wi, x ))2i = ν a 2 (11) In such an implementation, in order to keep the quantization error significantly below the targeted overall error ϵ, we need to have a p ϵ/ν. Unless the magnitudes of the weights used in the output layer are bounded in this way, accurate computation is not achievable in practice. 5 Sufficiency of exponentially many nodes In Section 4, we studied the ability of an SNN in approximating function µc defined as (7) and showed that such a network, if the weights are not allowed to grow exponentially with n, requires exponentially many nodes to make the error small. Clearly, Theorem 2 is a converse result, which implies that the number of nodes n1 should grow with n, at least as ρ n 4 (ρ > 1). The next natural question is the following: Would exponentially many nodes actually suffice in order to approximate function µc? In this section, we answer this question affirmatively and show a simple construction with random weights that, given enough nodes, is able to well approximate function µc defined in (7), within the desired accuracy. Recall that µc(x) = α n 4 exp( 1 2sf x 2), where α sf +2sx sf . Consider the output function of a single-hidden layer neural network with all biases set to zero. The function generated by such a network can be written as f(x)= Pn1 i=1 aiσ( wi, x ). (12) As before, here, σ : R R denotes the non-linear function and wi Rn, wi = 1, denotes the weights used by hidden node i. To show sufficiency of exponentially many nodes, we consider a special non-linear function σ(x) = cos(x/ sf). Theorem 3 Consider function µc : Rn R, defined in (7), and n-dimensional random vector x N(0n, sx In). Consider function f : Rn R defined in (12). Let σ(x) = cos(x/ sf), and, for i = 1, . . . , n1, ai = α n 4 /n1, where α = 1 + 2sx/sf. Given ϵ > 0, assume that n1 > 1 ϵ α n 2 . Then, there exists weights w1, . . . , wn1 such that Ex[(f(x) µc(x))2] ϵ. The Proof of Theorem 3 is presented in Section 7 of the SM. To better understand the implications of Theorem 3 and how it compares against Theorem 2, define m1 = ρ = 1 + sx 2sf (1 sf sf +2sx ) and m2 = α2 = 1 + 4sx sf ), where ρ is defined in (10). It is straightforward to see that 1 < m1 < m2, for all positive values of (sx, sf). Theorems 2 and 3 show that there exist constants c1 and c2, such that if the number of hidden nodes in a single-hidden-layer network (n1) is smaller than c1mn/4 1 , the expected error in approximating function µc(x) must get arbitrarily close to one. On other hand, if n1 is larger than c2mn/4 2 , then there exists a set of weights such that the error can be made arbitrary close to zero. In other words, it seems that there is a phase transition in the exponential growth of the number of nodes, below which, the function cannot be approximated with a single hidden layer. Characterizing that phase transition is an interesting open question, which we leave to future work. 6 Related work There is a rich and well-developed literature on the complexity of Boolean circuits, and the important role depth plays in them. However, since it is not clear to what extend such results on Boolean circuits has a consequence for DNNs, we do not summarize this literature. The interested reader may wish to start with [16]. A key notion for us is that of depth, that is to so say, the number of (hidden) layers of nodes in a neural network as defined in Section 1.2. We are interested to know to what extent, if any, depth reduces complexity of the neural network to express or approximate functions of interest in classification. It is not the complexity of the function that we want to approximate that matters, because the UAT already tells us that regular functions, which include discriminant functions we discuss in Section 1.3, can be approximated by SNNs, shallow neural networks. But the complexity of the NNs, as measured by the number of nodes needed for the approximation is of interest to us. In this respect, the work of [17, 18, 19] contain approximation results for neural structures for certain polynomials and tensor functions, in the spirit of what we are looking for, but as with Boolean circuits, these models deviate substantially from the standard DNN models we consider here, those that represent the neural networks that have worked well in practice and for whose behavior we wish to obtain fundamental insights. Remarkably, there is a small collection of recent results which, as in this paper, show that adding a single layer to an SNN reduces the number of nodes by a logarithmic factor for approximation of some special functions: see [13, 11, 12, 20] for approximation of high-degree polynomials, a certain radial function, special functions of the inner products of high-dimensional vectors, and saw-tooth functions, respectively. Our work is therefore in the same spirit as these, showing the power of two in the reduction of complexity of DNNs, and is therefore, the continuation and generalization of the said set of results and is especially informed by [11]. For a specialized radial function in Rn, [11] shows that while any SNN would require at least exponentially many nodes to approximate the function, there exists a DNN with two hidden layers and O(n19/4) nodes that well approximates the same function. In the present work, for a general class of widely-used functions viz GMM discriminant functions, we show that while SNNs require at least exponentially many nodes, for any GMM discriminant function there exists a DNN with two hidden layers and only O(n) nodes that approximates it. 7 Remarks and conclusion It is worth noting that even though to prove Theorem 1 we used a variety of sufficient regularity assumptions for the non-linear function σ, these assumptions are not necessary to construct an efficient two-layer network. For example, to construct a network using the commonly used Rectifier Linear Unit (Re Lu) activation, in the first layer we can form n super-nodes, each of which has a piecewise constant response hi(x) that approximates x2 with the accuracy specified in Lemma 1 of the SM. The number of basic nodes needed in each super-node in this construction is 2R/ ν, where R and ν denote the range and the accuracy for approximating x2 in layer one, respectively. The analysis of R and ν in Lemmas 3 and 5 in the SM show that R is O( n) and ν is O(1/n), so that the number of nodes needed per super-node in the first layer is now O(n), compared to O(1) in the construction presented in Section 3. Since there are n such nodes, the total number of basic nodes in the network becomes O(n2) - still an exponential reduction compared with a single layer network. [1] G. Cybenko. Approximations by superpositions of a sigmoidal function. Math. of Cont., Sig. and Sys., 2:183 192, 1989. [2] K.I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neu. net., 2(3):183 192, 1989. [3] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neu. Net., 2(5):359 366, 1989. [4] A. R. Barron. Approximation and estimation bounds for artificial neural networks. Mach. lear., 14(1):115 133, 1994. [5] J.L. Gauvain and C.H. Lee. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. on Speech and Aud. Proc., 2(2):291 298, 1994. [6] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker verification using adapted Gaussian mixture models. Dig. Sig. Proc., 10(1-3):19 41, 2000. [7] J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli. Image denoising using scale mixtures of gaussians in the wavelet domain. IEEE Trans. on Ima. Proc., 12(11):1338 1351, 2003. [8] Z. Zivkovic. Improved adaptive Gaussian mixture model for background subtraction. In Proc. of the 17th Int. Conf. on Pat. Rec., volume 2, pages 28 31. IEEE, 2004. [9] N. Indurkhya and F. J. Damerau. Handbook of natural language processing, volume 2. CRC Press, 2010. [10] E. Abbe and C. Sandon. Provable limitations of deep learning. ar Xiv preprint ar Xiv:1812.06369, 2018. [11] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conf. on Lear. Theory, pages 907 940, 2016. [12] A. Daniely. Depth separation for neural networks. ar Xiv preprint ar Xiv:1702.08489, 2017. [13] D. Rolnick and M. Tegmark. The power of deeper networks for expressing natural functions. In Int. Conf. on Lear. Rep., 2018. [14] E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In Proc. of ACM SIGKDD Int. Conf. on Know. Dis. and Data Min., pages 245 250. ACM, 2001. [15] H. W Lin, M. Tegmark, and D. Rolnick. Why does deep and cheap learning work so well? J. of Stat. Phy., 168(6):1223 1247, 2017. [16] A. Shpilka and A. Yehudayoff. Arithmetic Circuits: A Survey of Recent and Open Questions. Now, 2010. [17] O. Delalleau and Y. Bengio. Shallow vs. deep sum-product networks. In Adv. in Neu. Inf. Proc. Sys., pages 666 674, 2011. [18] J. Martens and V. Medabalimi. On the expressive efficiency of sum product networks. ar Xiv preprint ar Xiv:1411.7717, 2014. [19] N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: A tensor analysis. ar Xiv preprint ar Xiv:1509.05009, 2015. [20] M. Telgarsky. Representation benefits of deep feedforward networks. ar Xiv preprint ar Xiv:1509.08101, 2015.