# learning_very_simple_generative_models_is_hard__c4e648d9.pdf

Learning (Very) Simple Generative Models Is Hard

Sitan Chen UC Berkeley sitanc@berkeley.edu

Jerry Li Microsoft Research jerrl@microsoft.com

Yuanzhi Li CMU yuanzhil@andrew.cmu.edu

Motivated by the recent empirical successes of deep generative models, we study the computational complexity of the following unsupervised learning problem. For an unknown neural network F : Rd Rd , let D be the distribution over Rd

given by pushing the standard Gaussian N(0, Idd) through F. Given i.i.d. samples from D, the goal is to output any distribution close to D in statistical distance. We show under the statistical query (SQ) model that no polynomial-time algorithm can solve this problem even when the output coordinates of F are one-hidden-layer Re LU networks with log(d) neurons. Previously, the best lower bounds for this problem simply followed from lower bounds for supervised learning and required at least two hidden layers and poly(d) neurons [CGKM22, DV21]. The key ingredient in our proof is an ODE-based construction of a compactly supported, piecewise-linear function f with polynomially-bounded slopes such that the pushforward of N(0, 1) under f matches all low-degree moments of N(0, 1).

1 Introduction

In recent years, deep generative models such as variational autoencoders, generative adversarial networks, and normalizing flows [GPAM+14, KW13, RM15] have seen incredible success in modeling real world data. These work by learning a parametric transformation (e.g. a neural network) of a simple distribution, usually a standard normal random variable, into a complex and high-dimensional one. The learned distributions have been shown to be shockingly effective at modeling real world data. The success of these generative models begs the following question: when is it possible to learn such a distribution? Not only is this a very natural question from a learning-theoretic perspective, but understanding this may also lead to more direct methods to learn generative models for real data.

More formally, we consider the following problem. Let D be the unknown pushforward distribution over Rd given by f(g), where g N(0, Id) is a standard normal Gaussian, and f is an unknown feedforward neural network with non-linear (typically Re LU) activations. Such distributions naturally arise as the output of many common deep generative models in practice. The learner is given n samples from D, and their goal is to output the description of some distribution which is close to D.

When f is a one-layer network (i.e. of the form f(g) = Re LU(Wg)), there are efficient algorithms for learning the distribution [WDS19, LLDD20]. However, this setting is unsatisfactory in many ways, as one-layer networks lack much of the complex structure that makes these generative models so appealing in practice. Indeed, when the neural network only has a single layer, the resulting distribution is similar to a truncated Gaussian, and one can leverage techniques developed for learning from truncated samples. Notably, this structure disappears even with two-layer neural networks. Even in the two-layer case, despite significant interest, very little is known about how to learn D efficiently.

In fact, a recent line of work suggests that learning neural network pushforwards of Gaussians may be an inherently difficult computational task. Recent results of [DV21, CGKM22] show hardness of supervised learning from labeled Gaussian examples under cryptographic asssumptions, and the latter also demonstrates hardness for all statistical query (SQ) algorithms (see Section 1.3 for a more

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

detailed description of related work). These naturally imply hardness in the unsupervised setting (see supplement). However, these lower bound constructions still have their downsides. For one, all of these constructions require at least three layers (i.e. two hidden layers), and so leave open the possibility that efficient learning is possible when the neural network only has one hidden layer. Additionally, the resulting neural networks in these constructions are quite complicated. In particular, the size of the neural networks in these hard instances, that is, the number of hidden nonlinear activations for any output coordinate, must be polynomially large. This begs the natural question:

Can we learn pushforwards of Gaussians under one-hidden-layer neural networks of small size?

1.1 Our Results

We demonstrate strong evidence that despite the simplicity of the setting, this learning task is already computationally intractable. We show there is no polynomial-time statistical query (SQ) algorithm which can learn the distribution of f(g), when g N(0, Id) and each output coordinate of f is a one-hidden-layer neural network of logarithmic hidden size. We formally define the SQ model in Definition 2; we note that it is well-known to capture almost all popular learning algorithms [FGR+17].

Theorem 1.1 (informal, see Theorem 3.1). For any d > 0, and any C 1, there exists a family of one-hidden-layer neural networks F from Rd to Rd C so that the following properties hold. For any f F, let Df denote the distribution of f(g), for g N(0, Id). Then, we have that:

For all f F, d TV(N(0, Id), Df) = Ω(1),1

Every output coordinate of f is a sum of O(log d/ log log d) Re LUs, with poly(d)-bounded weights.

Any SQ algorithm which can distinguish between Df and N(0, Id) with high probability for all f F requires dω(1) time and/or samples.

In other words, there is a family of one-hidden-layer Re LU networks of logarithmic size whose corresponding pushforwards are statistically very far from Gaussian, yet no efficient SQ algorithm can distinguish them from a Gaussian. Note this implies hardness even of improperly learning such pushforwards: not only is it hard to recover the parameters of the network or output a network close to the underlying distribution Df, but it is hard to learn any distribution close to Df. In contrast, if one ignores issues of computational efficiency, one can easily learn networks in this family with polynomial sample complexity via hypothesis selection, as there are only poly(d) many parameters specifying any network in this family.

Since such networks are arguably some of the simplest neural networks with more than one layer, this suggests that learning even the most basic deep generative models may already be a very difficult task, at least without additional assumptions. Still, this is by no means the last word in this direction. Given the real world success of deep generative models, a natural and important direction is to identify natural conditions under which we can efficiently learn. We view our results as a first step towards understanding the computational landscape of this important learning problem, and our result provides evidence that (strong) assumptions need to be made on f for the pushforward f(g) to be efficiently learnable, even in very simple two-layer cases.

1.2 Our Techniques

Like many recent SQ lower bounds, ours follows the general framework which was introduced in [DKS17] and builds on [FGR+17]. Here one considers the following non-Gaussian component analysis task. We will consider a family of distributions, parametrized by a hidden direction v Sd 1, which are Gaussian in every direction orthogonal to the hidden direction but which are very non-Gaussian along the hidden direction.

To formalize this, let D be a distribution D over R which is known to the learner and far in statistical distance from Gaussian. Given a unit vector v in d dimensions, let P D v denote the distribution over Rd whose projection along v is given by D and whose projection in all directions orthogonal to v is standard Gaussian. Given samples from some unknown distribution over Rd, the goal is to

1d TV denotes total variation distance. A lower bound for Wasserstein distance also holds, see supplement.

decide whether the unknown distribution is N(0, Idd) or P D v for some v. [DKS17] showed that if D s moments match those of N(0, 1) up to some degree m, then under mild conditions, any SQ algorithm for this task requires at least dΩ(m) queries (Lemma 3.3).

Suppose one could exhibit a one-hidden-layer Re LU network f : Rℓ R such that the pushforward D = f(N(0, Id)) satisfied such properties. Then we can realize P D v as a pushforward as follows. Let U be a rotation mapping the first standard basis vector in Rd to v. Then consider the function F : Rℓ+d 1 Rd mapping z to U (f(z1, . . . , zℓ), zℓ+1, . . . , zℓ+d 1). One can check that every output coordinate of F is computed by a one-hidden-layer Re LU network with size essentially equal to that of f. By [DKS17], we would immediately get the desired SQ lower bound.

The main challenge is thus to construct such a network whose pushforward matches the low-degree moments of N(0, 1). It is not hard to ensure the existence of such f with essentially infinite weights (Corollary 4.2 and Lemma 4.4). It is much less clear whether this is possible with polynomially bounded weights, and this is our primary technical contribution. We design and analyze a certain ODE which defines a one-parameter family of perturbations to f, such that the low-degree moments of the corresponding pushforwards remain unchanged over time. By evolving along this family over an inverse-polynomial time scale, we obtain a network with polynomially bounded weights whose pushforward matches the low-degree moments of N(0, 1). We defer the details to Section 4.

1.3 Related Work

A full literature review on the theory of learning deep generative models is beyond the scope of this paper (see e.g. the survey of [GSW+21]). For conciseness we cover only the most relevant papers.

Upper bounds. In terms of upper bounds, much of the literature has focused on a different setting, where the goal is to understand when first order dynamics can learn toy generative models [FFGT17, DISZ17, GHP+19, LLDD20, AZL21, JMGL22], which are much simpler than the ones we consider here. For learning pushforwards of Gaussians under neural networks with Re LU activations, algorithms with provable guarantees are only known when the network has no hidden layers [WDS19, LLDD20]. This is in contrast to the supervised setting, where fixed parameter tractable algorithms are known for learning Re LU networks of arbitrary depth [CKM22].

A different line of work seeks to find efficient learning algorithms when the activations are given by low degree polynomials [FFGT17, LD20, CLLZ22]. Arguably the closest to our work is [CLLZ22], which gives polynomial-time algorithms for learning low-degree polynomial transformations of Gaussians, in a smoothed setting. It is a very interesting open question if similar smoothed assumptions can be leveraged to circumvent our lower bound when the activations are Re LU. Unfortunately, these papers heavily leverage the nice moment structure of low-degree Gaussian polynomials, and it is unclear how their techniques can generalize to different activations.

Lower bounds. Much of the literature on lower bounds for learning neural networks has focused on the supervised setting, where a learner is given labeled examples (x, f(x)), and the goal is to output a good predictor. There are many lower bounds known in the distribution-free setting [BR92, Vu98, KS09, LSSS14, DV20], however, these do not transfer over to our (unsupervised) setting. When x is Gaussian, the aforementioned work of [CGKM22] derives hardness for learning two-hidden-layer networks with polynomial size for all SQ algorithms, as well as under cryptographic assumptions (see also [DV21]). It is not hard to show (see supplement) that this lower bound immediately implies a lower bound for the unsupervised problem. In the supervised setting, lower bounds are also known against restricted families of SQ [VW19, GGJ+20, DKKZ20, SVWX17], when there are adversarially noisy labels [KK14, DKZ20, GGK20, SZB21], and in discrete settings [Val84, Kha95, AK95, Fel09, CGV15, DGKP20, AAK21], but to our knowledge, these results do not transfer to our setting.

The literature on lower bounds for the unsupervised problem we consider here is much sparser. Besides [DV21, CGKM22], we also mention the recent work of [CLLM22] that studies whether achieving small Wasserstein GAN loss implies distribution learning. A corollary of their results is cryptographic hardness for learning pushforwards of Gaussians under networks with constant depth and polynomial size, but only when the learner is given by a Lipschitz Re LU network discriminator. However, this does not rule out efficient algorithms which do not output such Lipschitz discriminators.

Finally, we remark that the family of hard distributions we construct can be thought of as a close cousin of the parallel pancakes construction of [DKS17]. This and slight modifications thereof are mixtures of Gaussians which are known to be computationally hard both in the SQ model [DKS17, BLPR19] and under cryptographic assumptions [BRST21, GVV22].

SQ lower bounds via ODEs. We remark that in a very different context, [DKZ20] also used an ODE to design a moment-matching construction. While our approach draws inspiration from theirs, an important difference is that they use their ODE as a size reduction trick to construct a step function f : R { 1} with a small number of linear pieces such that Eg N(0,1)[f(g)gk] = 0 for all small k, while we use our ODE as a weight reduction trick to construct a continuous neural network f : R R with bounded weights such that Eg N(0,1)[f(g)k] = Eg N(0,1[gk]. The form of the moments Eg N(0,1)[f(g)gk] they consider is simpler than in our setting, and while they essentially run their ODE to singularity and use non-quantitative facts like the invertibility of a certain Jacobian, we only run our ODE for a finite horizon and need to carefully control the condition number of the Jacobian arising in our setting over this horizon (e.g. Lemma 4.8).

2 Technical Preliminaries

We freely abuse notation and use the same symbols to denote probability distributions, their laws, and their density functions. Given a distribution A over a domain Ωand a function f : Ω Ω , we let f(A) denote the pushforward of A through f, that is, the distribution A of the random variable f(z) for z A. Given distributions p, q for which p is absolutely continuous with respect to q, let χ2(p, q) denote their chi-squared divergence. Let p q denote the convolution of p and q. Also, we use p to denote ℓp norm, omitting the subscript when p = 2. σmin( ) denotes minimum singular value. Given an invertible matrix M, we use M to denote the matrix (M ) 1 = (M 1) .

Henceforth Eg[ ] will always denote Eg N(0,Id)[ ], where the dimensionality is implicit from context. Let γσ2(x) 1 σ

2πe x2/(2σ2). Given S R, we use γσ2(S) to denote R γ(x) 1[x S] dx.

When σ = 1, we omit the subscript σ2. We will also use γ(d)(x) to denote the density of N(0, Idd).

Definition 1 (One-hidden-layer Re LU networks). We say that g : Rd R is a one-hidden-layer Re LU network with size S and W-bounded weights if there exist w1, . . . , w S Rd, b1, . . . , b S R, and s1, . . . , s S { 1} for which g(x) = PS i=1 si Re LU( wi, x + bi) and wi , |bi| W for all i S.2 Here Re LU(z) max(0, z). Given f : Rd Rd , together with a distribution A over Rd, we say that f(A) is a one-hidden-layer Re LU network pushforward of A with size S and W-bounded weights if each coordinate of f is of this form.

Fact 2.1 ([GMSR20]). If V Rn n is a Vandermonde matrix with nodes z1, . . . , zn, that is, Vi,j = zi 1 j , and {zi} are ζ-separated, then σmin(V ) 1

Theorem 2.2 (Peano s existence theorem, see e.g. Theorem 2.1 from [Har02]). For T, r > 0 and y0 Rn, let B R Rn be the parallelepiped consisting of (t, y) for which 0 t T and y y0 r. If f : B R is continuous and satisfies |f(t, y)| M for all (t, y) B, then the initial value problem {y (t) = f(t, y) and y(0) = y0} has a solution over t [0, min(T, r/M)].

Definition 2 (Statistical queries). Given distribution D over Rd and parameters τ, t > 0, a STAT(τ) oracle takes in any query of the form f : Rd [ 1, 1] and outputs a value from [Ex D[f(x)] τ, Ex D[f(x)] + τ], while a VSTAT(t) oracle takes in any query of the form f : Rd [0, 1] and outputs a value from [Ex D[f(x)] τ, Ex D[f(x)] + τ] for τ = max(1/t, p

Vx D[f(x)]/t).

3 Statistical Query Lower Bound

Our main result is the following lower bound in this model for learning Re LU network pushforwards:

Theorem 3.1. Let d N be sufficiently large. Any SQ algorithm which, given SQ access to an arbitrary one-hidden-layer Re LU network pushforward of N(0, Idd) of size O(log d/ log log d) with poly(d)-bounded weights, outputs a distribution which is O(1)-close in d TV( ) must make at least dΩ(log d/ log log d) queries to either STAT(τ) or VSTAT(1/τ 2) for τ = d Ω(log d/ log log d).

2In practice, one can think of S as the number of channels, e.g. in a deconvolution network.

Our proof will invoke the following key technical result whose proof we defer to Section 4. It exhibits a one-hidden-layer Re LU network f : R2 R with bounded weights under which the pushforward of N(0, 1) matches the low-degree moments of N(0, 1) to arbitrary precision, in addition to some other technical conditions that we need to formally establish our statistical query lower bound: Theorem 3.2. Fix any odd m and ν, σ < 1. There is a one-hidden-layer Re LU network f : R2 R of size O(m) with weights at most m O(m) for which the pushforward D f (N(0, Id)) satisfies

1. | Ex D[xk] Eg N(0,1)[gk]| < ν for all k = 1, . . . , m

2. χ2(D, N(0, 1)) exp(O(m))/σ

3. d TV(P D v , P D v ) 1 2σ log(1/σ) m Ω(m) for any v, v Sd 1 satisfying | v, v | 1/2.

The rest of the proof of our lower bound will then follow the framework introduced in [DKS17] and subsequently generalized in [DK20]. First, given a distribution D over R and v Sd 1, let P D v denote the distribution over Rd with density P D v (x) = D( v, x ) γ(d 1)(x v, x v), that is the distribution which is given by D in the direction v and is given by N(0, Id vv ) orthogonal to v. We use the following generic statistical query lower bound about learning such distributions, a proof of which we include in the supplement for completeness. Lemma 3.3. Let m N and 0 < C < 1/2. Let D be a distribution such that 1) χ2(D, N(0, 1)) is finite, and 2) | Ex D[xk] Eg N(0,1)[gk]| Ω(d) (m+1)(1/4 C/2)p

χ2(D, N(0, 1)) for all k = 1, . . . , m.

Consider the set of distributions {P D v }v Sd 1 for d mΩ(1/C). If there is some ϵ > 0 for which d TV(P D v , P D v ) > 2ϵ whenever | v, v | 1/2, then any SQ algorithm which, given SQ access to P D v for an unknown v Sd 1, outputs a hypothesis Q with d TV(Q, P D v ) ϵ needs at least dm+1

queries to STAT(τ) or to VSTAT(1/τ 2) for τ O(d) (m+1)(1/4 C/2) p

χ2(D, N(0, 1)).

Proof of Theorem 3.1. By Theorem 3.2 applied with sufficiently large odd m and sufficiently small σ, there exists a distribution D = f (N(0, Id2)) over R for f : R2 R of size O(m) with m O(m)- bounded weights satisfying the hypotheses of Lemma 3.3 for ϵ = 0.49, and χ2(D, N(0, 1)) exp(O(m)) (note that we can absorb the 1/σ factor in Theorem 3.2 into exp(O(m)) because we can take σ = exp( Θ(m))). As long as m d O(C), we conclude that an SQ algorithm for learning any distribution from {P D v }v Sd 1 to total variation distance 1/4 must make at least dm+1

queries to STAT(τ) or VSTAT(1/τ 2) for τ O(d) (m+1)(1/4 C/2) exp(O(m)). By taking m = Θ(log d/ log log d), we ensure that m O(m) poly(d). We re done by taking C in Lemma 3.3 to be C = 1/4.

The proof of the theorem is complete upon noting that any distribution P D v can be implemented as a pushforward of N(0, Idd+1) under a one-hidden-layer Re LU network Fv : Rd+1 Rd of size O(log d/ log log d) with poly(d)-bounded weights. Let U O(d) be a rotation mapping the first standard basis vector in Rd to v. Then for Fv(z1, . . . , zd+1) U(f (z1, z2), z3, . . . , zd+1) we have that Fv(N(0, Idd+1)) = P D v as desired. Furthermore, note that every output coordinate of Fv(z1, . . . , zd+1) is a one-hidden-layer Re LU network of the form αf (z1, z2)+ u, (z3, . . . , zd+1) for some vector (α, u) Rd. Note that the size of this network is two plus that of f , and its weights are also upper bounded by poly(d), so Fv s output coordinates are of size O(log d/ log log d). Remark 3.4. Theorem 1.1 was stated with output dimension polynomially bigger than input dimension, whereas in our construction, the output dimension (d) is less than the input dimension (d + 1). One can get the former by a padding argument (i.e. by duplicating output coordinates) to give a generator with arbitrarily large polynomial stretch and such that the dlog d/ log log d lower bound still applies.

4 Moment-Matching Construction

In this section we prove Theorem 3.2, the main technical ingredient in the proof of Theorem 3.1.

4.1 Moment-Matching With Unbounded Weights

We begin by making the simple observation that for one-hidden-layer networks with unbounded weights, it is easy to construct networks such that the pushforward of N(0, 1) under these networks

matches the moments of N(0, 1). The starting point for this is the following standard construction. Roughly speaking, it gives a set of well-separated points on the real line and a distribution over them which matches low-order moments with N(0, 1). Lemma 4.1 (Lemma 4.3 from [DKS17]). For any m N, there exist weights λ1, . . . , λm 0 and points h1, . . . , hm R for which

1. (Moments match) Pm i=1 λihk i = Eg[gk] for all k = 0, . . . , 2m 1 (note that the special case of k = 0 implies Pm i=1 λi = 0).

2. (Points symmetric about origin) h1 hm and hi = hm i+1 for all 1 i m.

3. (Weights symmetric) λ1 λ m/2 and λi = λm i+1.

4. (Points bounded and separated) Ω(1/ m) |hi| O( m) for all 1 i m and {hi} are Ω(1/ m)-separated.

5. (Weights not too small) mini λi e cm for an absolute constant c > 0.

6. (Central point and weight) If m is odd, then h(m+1)/2 = 0 and λ(m+1)/2 = Θ(1/ m).

This immediately implies that there exists a discontinuous piecewise linear function f : R R for which the pushforward f(N(0, 1)) matches the low-degree moments of N(0, 1). The reason is that we can take f to be a step function which takes on values given by the positions of the points h1, . . . , hm in Lemma 4.1. We can then take the steps to have lengths such that the probability that a standard Gaussian input to f falls under the step given by some hi is precisely λi (see Figure 1). Corollary 4.2. For any m N, there is a partition of R into disjoint intervals I1, . . . , Im, along with a choice of scalars h1, . . . , hm, such that the step function f : R R given by f(z) = Pm i=1 hi 1[z Ii] satisfies Ex f(N(0,1))[xk] = Eg N(0,1)[gk] for all k = 0, . . . , 2m 1.

Proof. Let {λi}, {hi} be as in Lemma 4.1. As P

i λi = 1, there are intervals R = I1 Im for which γ(Ii) = λi. As Ex f(N(0,1))[xk] = P

i λihk i , the claim follows by Part 1 of Lemma 4.1.

By infinitesimally perturbing f in Corollary 4.2, we can ensure f(N(0, 1)) still approximately matches the low-degree moments of N(0, 1) to arbitrary precision and that the linear pieces of f have finite slopes, though some slopes will be arbitrarily large. The new function can thus be represented as a one-hidden-layer Re LU network, but unfortunately its weights will be arbitrarily large. The key challenge in the sequel is to design a better perturbation so the resulting f has polynomially bounded slopes yet is such that f(N(0, 1)) matches the low-degree moments of N(0, 1).

4.2 Bump Construction

Before we describe our perturbation scheme, we slightly modify the construction in Corollary 4.2. In place of a step function, we will consider a certain sum of bump functions, illustrated in Figure 2. Definition 3 (Bump functions). Given w, ϵ > 0 and h, c R, define T w,h,ϵ c : R R by

T w,h,ϵ c (z) =

ϵ (z c + ϵ + w) if z [c ϵ w, c w] h if z [c w, c + w] h

ϵ (z c ϵ w) if z [c + w, c + ϵ + w] 0 otherwise.

Because these are piecewise linear functions, we can implement them as one-hidden-layer Re LU networks: Fact 4.3. Given w, ϵ > 0 and h, c R, T w,h,ϵ c can be implemented as a one-hidden-layer Re LU network with size 4 and W-bounded weights for W h

ϵ max(1, |c| + ϵ + w). In the next lemma, we show that we can replace the construction in Corollary 4.2, given by a step function, with a sum of well-separated bumps with infinitesimally small ϵ parameter. This will be the instance on which we will apply our perturbation scheme. Lemma 4.4. For any odd m and ν < 1, there are centers c1 cm 1, widths w1, . . . , wm 1 > 0, heights h1 hm 1 R, and parameter ϵ > 0 for which the following holds. Define the function f : R R (see Figure 1) by f(z) Pm 1 i=1 T wi,hi,ϵ ci (z) for any 0 ϵ < ϵ. Then f satisfies

Figure 1: Left: construction from Lemma 4.1 gives rise to step function in Corollary 4.2. Right: removing central spike gives rise to sum of bumps in Lemma 4.4

1. (Bumps are well-separated) For all 1 i < m 1, ci + m 3/2 ci+1.

2. (Moments match) | Ex f(N(0,1))[xk] Eg[gk]| < ν for all k = 1, . . . , 2m 1.

3. (Symmetricity) wi = wm i, hi = hm i, and ci = cm i for all 1 i < m.

4. (Bounded, separated heights) Ω(1/ m) |hi| O( m) for all 1 i < m, and {hi} are Ω(1/ m)-separated.

5. (Intervals not too thin) mini γ([ci wi, ci + wi]) e cm for an absolute constant c > 0.

6. (Bounded endpoints) |ci| + wi O(log m) for all 1 i < m.

Proof sketch, see supplement. The idea is that when m is odd, one of the points hi from Lemma 4.1 is zero and does not contribute to the moments. We want to design f for which f(N(0, 1)) is the mixture of point masses given by removing this point and its corresponding weight. We take f to have one linear piece for every one of the remaining point masses. By Part 6 of Lemma 4.1, these points have total weight 1 Θ(1/ m), so by anticoncentration we can arrange these linear pieces to be m 3/2-separated, satisfying Part 1. The remaining parts follow easily from Lemma 4.1.

Unfortunately the slopes in the function constructed in Lemma 4.4 are arbitrarily large if ν is arbitrarily small. The issue still remains of how to get a continuous piecewise-linear function whose slopes are polynomially bounded so that the corresponding Re LU network has polynomially bounded weights. As we illustrate next however, the Ω(m 3/2) spacing between the bumps in the definition of f in Lemma 4.4 gives us sufficient room to carefully perturb f to achieve this goal.

4.2.1 Some Estimates for Bump Moments

For convenience, define M w,h,ϵ c,k Eg T w,h,ϵ c (g)k . We conclude this subsection by collecting some

useful bounds for this quantity. As we verify in the supplement, M w,h,ϵ c,k is continuously differentiable with respect to ϵ when ϵ > 0. Additionally, we have the following (see supplement):

M w,h,ϵ x,k h = k

h M w,h,ϵ x,k .

Lemma 4.6. For any w 0, c, h, h R, ϵ ϵ 0, and even k N, we have the bound

c,k M w,h,ϵ c,k | hk |(h /h)k 1| + ϵ ϵ . In particular, this implies that M w,h,ϵ c,k ϵ hk.

4.3 ODE-Driven Perturbation

Denote the parameters of the function constructed in Lemma 4.4 by {(hi(0), wi, ci)}1 i<m. We also define ϵ(0) to be some arbitrarily small positive quantity satisfying ϵ(0) ϵ for ϵ from Lemma 4.4.

We will design an ODE whose solution specifies a one-parameter family of functions

i=1 T wi,hi(t),ϵ(t) ci (1)

that arise from gradually perturbing the function from Lemma 4.4. Roughly speaking, starting at hi(0) and ϵ(0) for all 1 i < m, perturbing the function along this one-parameter family will correspond to keeping the widths wi and centers ci of the bumps fixed, increasing the ϵ parameter of every bump at unit speed, and evolving the heights hi(t) in such a way that the moments of the pushforward of N(0, 1) under ft remain constant in t for all 0 t T. We illustrate this evolution

in Figure 2. Here T is some horizon which is at least inverse-polynomially large but smaller than m 3/2 so that the edges ci (ϵ(t) + wi) of the bumps don t collide with each other (this is where we make crucial use of Part 1 of Lemma 4.4). At the end of this horizon, we want to show that the heights will not have changed too much, whereas the bumps now have ϵ parameter given by inverse-polynomially large T. This will imply that f T has polynomially bounded slopes as desired.

Figure 2: Evolution of one of the m 1 bumps constituting ft

As the odd moments of f0(N(0, 1)) vanish and the parameters {(hi(0), wi, ci)}1 i<m satisfy the symmetry properties from Part 3 of Lemma 4.4, it is easy to ensure that the odd moments of ft(N(0, 1)) also vanish: simply take hi(t) = hm i(t) for all 1 i m 1.

We thus focus on evolving (h1(t), . . . , h(m 1)/2(t)). For convenience denote this by h(t). Define the moment vector µ : R(m 1)/2 R Rm 1 by

i=1 M wi,hi,ϵ ci,2ℓ

1 ℓ (m 1)/2

For any fixed (h, ϵ) in a small neighborhood of (h(0), ϵ(0)), we want to show there is a direction v R(m 1)/2 such that the directional derivative of µ in the direction w (v1, . . . , v(m 1)/2, 1) is zero. The constraint that the directional derivative wµ(h, ϵ) vanishes specifies a linear system in v:

i=1 vi M wi,hi,ϵ ci,2ℓ hi =

M wi,hi,ϵ ci,2ℓ ϵ 1 ℓ (m 1)/2. (2)

Recalling Lemma 4.5, we can rewrite this as

hi M wi,hi,ϵ ci,2k =

M wi,hi,ϵ ci,2ℓ ϵ 1 ℓ (m 1)/2.

To express this more compactly, define b(ϵ) R(m 1)/2 and Z(h, ϵ) by

M wi,hi,ϵ ci,2ℓ ϵ and Z(h, ϵ)i,ℓ M wi,hi,ϵi ci,2ℓ . (3)

Also define the matrices A(h) diag(1/h1, . . . , 1/h(m 1)/2) and B diag(2, 4, . . . , m 1). Then (2) is equivalent to v A(h)Z(h, ϵ)B = b(ϵ) . Provided A(h)Z(h, ϵ)B is invertible, the natural choice for v would thus be v = B 1Z(h, ϵ) A(h) 1 b(ϵ). Therefore, defining

w(t, h) B 1Z(h, ϵ(0) + t) A(h) 1 b(ϵ(0) + t), 1 , (4)

we consider the following initial value problem

h (t) = w(t, h(t)) and h(0) = (h1(0), . . . , h(m 1)/2(0)). (5)

Note that if we had a solution h(t) to (5) for t [0, T] for some horizon T, then we would have

tµ(h(t), t) = (h(t), t)µ(h(t), t) h (t) = w(t)µ(h(t), t) = 0, (6)

implying that the low-degree moments of ft defined in (1) are constant in t as desired.

4.4 Existence and Boundedness of h(t)

To carry out the strategy in Section 4.3, we must establish that (1) a solution to the initial value problem (5) exists over a non-negligible horizon T, and (2) the entries of h(t) do not explode in t.

For both of these, we need to show that the matrix Z(h, ϵ) is invertible or, more specifically, wellconditioned for (h, ϵ) in a neighborhood of (h(0), ϵ(0)). We first establish this at time t = 0 by relating Z(h(0), ϵ(0)) to a certain Vandermonde matrix and appealing to Fact 2.1, see supplement: Lemma 4.7. σmin(Z(h(0), ϵ(0))) m Cm for an absolute constant C > 0.

Thus, for (h, ϵ) in a neighborhood of (h(0), ϵ(0)), Z(h, ϵ) is well-conditioned (see supplement).

Lemma 4.8. Let C > 0 be from Lemma 4.7. For any (h, ϵ) satisfying h h(0) m C m and 0 ϵ ϵ(0) m C m for sufficiently large absolute constant C > 0, σmin(Z(h, ϵ)) m Cm/2.

To establish Property (1), we must first verify that w(t, h) is continuous (see supplement for proof): Lemma 4.9. Let C > 0 be the absolute constant from Lemma 4.8. Then the function w(t, h) defined in (4) is continuous with respect to both t and h for t m C m and h h(0) m C m.

Lastly, we must show that under the hypotheses of Lemma 4.9, w(t, h) is not too large. This implies Property (1) by Theorem 2.2 and Property (2) because h (t) = w(t, h(t)) :

Lemma 4.10. Let C > 0 be the absolute constant from Lemma 4.8. If t m C m and h h(0) m C m, then w(t, h) m C m for some absolute constant C > 0.

Proof. By the second part of Lemma 4.6, every entry of b(ϵ(0)+t) is at most m 1

2 (maxi hi)m 1 m 1

2 (O( m)+m C m)m 1 m O(m), where in the penultimate step we used Part 4 of Lemma 4.4 and our hypothesis on h. Note that σmin(Z(h, ϵ(0) + t)) m Cm/2 by Lemma 4.8, σmin(B) 2, and σmin(A(h)) mini 1/hi Ω( m) by Part 4 of Lemma 4.4 and our hypothesis on h. We conclude that B 1Z(h, ϵ(0) + t) A(h) 1 b(ϵ(0) + t) has L norm at most m C m for some absolute constant C > 0, so w(t, h) m C m as claimed.

We are now ready to put these ingredients together to prove the key lemma for showing Theorem 3.2: Lemma 4.11. Fix any odd m N and any 0 < ν < 1. There is a one-hidden-layer Re LU network f : R R of size O(m) with weights at most m O(m) for which the pushforward D = f(N(0, 1)) satisfies | Ex D[xk] Eg N(0,1)[gk]| < ν for all k = 1, . . . , m.

Proof. Define the parallelepiped B of pairs (t, h) for which 0 t m C m and h h(0) m C m. By Lemma 4.9, w is continuous over B. By Lemma 4.10, w(t, h) m C m.

By Theorem 2.2, the initial value problem in (5) has a solution h(t) over t [0, T] for T = m (C +C )m. Furthermore, because h (t) = w(t, h(t)) m C m, we conclude that 1 T h(T) m C m. The slopes of the bumps T wi,hi(T ),ϵ(0)+T ci are therefore bounded by m C m.

For (m 1)/2 < i m 1, define hi(T) = hm i(T) and consider the one-parameter family of functions ft Pm 1 i=1 T wi,hi(t),ϵ(0)+t ci . Because ci = cm i and wi = wm i for all 1 i < m by Part 3 of Lemma 4.4, we conclude that the odd moments of f T (N(0, 1)) all vanish. As for the even moments, because

tµ(h(t), t) = 0 by (6), we conclude that the even moments of f T (N(0, 1)) up to degree m 1 agree with those of f0(N(0, 1)). So because D0 = f0(N(0, 1)) satisfies | Ex D[xk] Eg N(0,1)[gk]| < ν for 1 k m by Part 2 of Lemma 4.4, the same holds for f T .

As the endpoints of the intervals supporting the bumps are bounded by O(log m), Fact 4.3 implies f T is a one-hidden-layer network with size O(m) and m O(m)-bounded weights.

4.5 Proof of Theorem 3.2

Having designed a pushforward given by a one-hidden-layer network with m O(m)-bounded weights satisfying the first part of Theorem 3.2, we now modify this to also satisfy the remaining two parts.

To do this, we convolve a suitable scaling of the pushforward by a thin Gaussian. The following lemmas show that the moments continue to match, the chi-squared divergence between the result and N(0, 1) is bounded, and the third part of Theorem 3.2 is satisfied (see supplement for proof):

Lemma 4.12. Let D be any symmetric distribution for which | Ex D[xk] Eg[gk]| < ν for all 1 k < m. For any c R let c D denote the distribution obtained by rescaling D by a factor of c. Then D

1 σ2 D N(0, σ2) satisfies | Ex D [xk] Eg[gk]| < ν for all 1 k < m.

Lemma 4.13. For any density A on [ R, R] and σ 1/2, χ2(A N(0, σ2), N(0, 1)) e O(R2)/σ.

Lemma 4.14. Let D = f(N(0, 1)) be from Lemma 4.11, and define D

1 σ2 D N(0, σ2). Then for any v, v Sd 1 satisfying | v, v | 1/2, d TV(P D v , P D v ) 1 2σ log(1/σ) m Ω(m).

Proof of Theorem 3.2. Let f be the function constructed in Lemma 4.11 and define f : R2 R by f (z1, z2) =

1 σ2f(z1) + σz2. Note that f (N(0, Id)) is exactly

1 σ2f(N(0, 1)) N(0, σ2), so the three parts of Theorem 3.2 follow immediately from Lemmas 4.12, 4.13, and 4.14 respectively.

Conclusion. In this work we established that learning one-hidden-layer Re LU network pushforwards of logarithmic size is hard for efficient SQ algorithms. Given that SQ algorithms capture a large family of efficient algorithms, this suggests that this basic learning problem, despite its simple description, is computationally intractable. Of course, our lower bound construction is worst case, and so it is unlikely to actually arise in practice. A natural open question, as mentioned previously, is to come up with natural assumptions (for instance, smoothed analysis settings like in [CLLZ22]) that allow us to circumvent this lower bound, and to achieve efficient learning algorithms.

[AAK21] Naman Agarwal, Pranjal Awasthi, and Satyen Kale. A deep conditioning treatment of neural networks. In Algorithmic Learning Theory, pages 249 305. PMLR, 2021.

[AK95] Dana Angluin and Michael Kharitonov. When won t membership queries help? Journal of Computer and System Sciences, 50(2):336 355, 1995.

[AZL21] Zeyuan Allen-Zhu and Yuanzhi Li. Forward super-resolution: How can gans learn hierarchical generative models for real-world distributions. ar Xiv preprint ar Xiv:2106.02619, 2021.

[BLPR19] Sébastien Bubeck, Yin Tat Lee, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational constraints. In International Conference on Machine Learning, pages 831 840. PMLR, 2019.

[BR92] Avrim L Blum and Ronald L Rivest. Training a 3-node neural network is np-complete. Neural Networks, 5(1):117 127, 1992.

[BRST21] Joan Bruna, Oded Regev, Min Jae Song, and Yi Tang. Continuous lwe. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 694 707, 2021.

[CGKM22] Sitan Chen, Aravind Gollakota, Adam R Klivans, and Raghu Meka. Hardness of noisefree learning for two-hidden-layer neural networks. ar Xiv preprint ar Xiv:2202.05258, 2022.

[CGV15] Aloni Cohen, Shafi Goldwasser, and Vinod Vaikuntanathan. Aggregate pseudorandom functions and connections to learning. In Theory of Cryptography Conference, pages 61 89. Springer, 2015.

[CKM22] Sitan Chen, Adam R Klivans, and Raghu Meka. Learning deep relu networks is fixed-parameter tractable. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 696 707. IEEE, 2022.

[CLLM22] Sitan Chen, Jerry Li, Yuanzhi Li, and Raghu Meka. Minimax optimality (probably) doesn t imply distribution learning for gans. ar Xiv preprint ar Xiv:2201.07206, 2022.

[CLLZ22] Sitan Chen, Jerry Li, Yuanzhi Li, and Anru R Zhang. Learning polynomial transformations. ar Xiv preprint ar Xiv:2204.04209, 2022.

[DGKP20] Abhimanyu Das, Sreenivas Gollapudi, Ravi Kumar, and Rina Panigrahy. On the learnability of random deep networks. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 398 410. SIAM, 2020.

[DISZ17] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. ar Xiv preprint ar Xiv:1711.00141, 2017.

[DK20] Ilias Diakonikolas and Daniel M Kane. Hardness of learning halfspaces with massart noise. ar Xiv preprint ar Xiv:2012.09720, 2020.

[DKKZ20] Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, and Nikos Zarifis. Algorithms and sq lower bounds for pac learning one-hidden-layer relu networks. In Conference on Learning Theory, pages 1514 1539. PMLR, 2020.

[DKS17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73 84. IEEE, 2017.

[DKZ20] Ilias Diakonikolas, Daniel Kane, and Nikos Zarifis. Near-optimal sq lower bounds for agnostically learning halfspaces and relus under gaussian marginals. Advances in Neural Information Processing Systems, 33:13586 13596, 2020.

[DV20] Amit Daniely and Gal Vardi. Hardness of learning neural networks with natural weights. Advances in Neural Information Processing Systems, 33:930 940, 2020.

[DV21] Amit Daniely and Gal Vardi. From local pseudorandom generators to hardness of learning. In Conference on Learning Theory, pages 1358 1394. PMLR, 2021.

[Fel09] Vitaly Feldman. On the power of membership queries in agnostic learning. The Journal of Machine Learning Research, 10:163 182, 2009.

[FFGT17] Soheil Feizi, Farzan Farnia, Tony Ginart, and David Tse. Understanding gans: the lqg setting. ar Xiv preprint ar Xiv:1710.10793, 2017.

[FGR+17] Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh S Vempala, and Ying Xiao. Statistical algorithms and a lower bound for detecting planted cliques. Journal of the ACM (JACM), 64(2):1 37, 2017.

[GGJ+20] Surbhi Goel, Aravind Gollakota, Zhihan Jin, Sushrut Karmalkar, and Adam Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. In International Conference on Machine Learning, pages 3587 3596. PMLR, 2020.

[GGK20] Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33:2147 2158, 2020.

[GHP+19] Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Rémi Le Priol, Gabriel Huang, Simon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1802 1811. PMLR, 2019.

[GMSR20] Spencer Gordon, Bijan Mazaheri, Leonard J Schulman, and Yuval Rabani. The sparse hausdorff moment problem, with application to topic models. ar Xiv preprint ar Xiv:2007.08101, 2020.

[GPAM+14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

[GSW+21] Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, and Jieping Ye. A review on generative adversarial networks: Algorithms, theory, and applications. IEEE Transactions on Knowledge and Data Engineering, 2021.

[GVV22] Aparna Gupte, Neekon Vafa, and Vinod Vaikuntanathan. Continuous lwe is as hard as lwe & applications to learning gaussian mixtures. ar Xiv preprint ar Xiv:2204.02550, 2022.

[Har02] Philip Hartman. Ordinary differential equations. SIAM, 2002.

[JMGL22] Samy Jelassi, Arthur Mensch, Gauthier Gidel, and Yuanzhi Li. Adam is no better than normalized SGD: Dissecting how adaptivity improves GAN performance, 2022.

[Kha95] Michael Kharitonov. Cryptographic lower bounds for learnability of boolean functions on the uniform distribution. Journal of Computer and System Sciences, 50(3):600 610, 1995.

[KK14] Adam Klivans and Pravesh Kothari. Embedding hard learning problems into gaussian space. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014.

[KS09] Adam R Klivans and Alexander A Sherstov. Cryptographic hardness for learning intersections of halfspaces. Journal of Computer and System Sciences, 75(1):2 12, 2009.

[KW13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

[LD20] Yuanzhi Li and Zehao Dou. Making method of moments great again? how can gans learn distributions. ar Xiv preprint ar Xiv:2003.04033, 2020.

[LLDD20] Qi Lei, Jason Lee, Alex Dimakis, and Constantinos Daskalakis. SGD learns onelayer networks in wgans. In International Conference on Machine Learning, pages 5799 5808. PMLR, 2020.

[LSSS14] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. Advances in neural information processing systems, 27, 2014.

[RM15] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530 1538. PMLR, 2015.

[SVWX17] Le Song, Santosh Vempala, John Wilmes, and Bo Xie. On the complexity of learning neural networks. Advances in neural information processing systems, 30, 2017.

[SZB21] Min Jae Song, Ilias Zadik, and Joan Bruna. On the cryptographic hardness of learning single periodic neurons. Advances in Neural Information Processing Systems, 34, 2021.

[Val84] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984.

[Vu98] Van H Vu. On the infeasibility of training neural networks with small mean-squared error. IEEE Transactions on Information Theory, 44(7):2892 2900, 1998.

[VW19] Santosh Vempala and John Wilmes. Gradient descent for one-hidden-layer neural networks: Polynomial convergence and sq lower bounds. In Conference on Learning Theory, pages 3115 3117. PMLR, 2019.

[WDS19] Shanshan Wu, Alexandros G Dimakis, and Sujay Sanghavi. Learning distributions generated by one-layer relu networks. Advances in neural information processing systems, 32, 2019.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See end of Section 1.1.

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] All proofs are present in either the main body or the forthcoming supplement. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]