# statisticalquery_lower_bounds_via_functional_gradients__bbd53a67.pdf

Statistical-Query Lower Bounds via Functional Gradients

Surbhi Goel Microsoft Research New York, NY, USA goel.surbhi@microsoft.com

Aravind Gollakota University of Texas at Austin Austin, TX, USA aravindg@cs.utexas.edu

Adam Klivans University of Texas at Austin Austin, TX, USA klivans@cs.utexas.edu

We give the ﬁrst statistical-query lower bounds for agnostically learning any nonpolynomial activation with respect to Gaussian marginals (e.g., Re LU, sigmoid, sign). For the speciﬁc problem of Re LU regression (equivalently, agnostically learning a Re LU), we show that any statistical-query algorithm with tolerance n (1/ϵ)b must use at least 2ncϵ queries for some constants b, c > 0, where n is the dimension and ϵ is the accuracy parameter. Our results rule out general (as opposed to correlational) SQ learning algorithms, which is unusual for real-valued learning problems. Our techniques involve a gradient boosting procedure for amplifying recent lower bounds due to Diakonikolas et al. (COLT 2020) and Goel et al. (ICML 2020) on the SQ dimension of functions computed by two-layer neural networks. The crucial new ingredient is the use of a nonstandard convex functional during the boosting procedure. This also yields a best-possible reduction between two commonly studied models of learning: agnostic learning and probabilistic concepts.

1 Introduction

In this paper we continue a recent line of research exploring the computational complexity of fundamental primitives from the theory of deep learning [GKK19, YS19, DKKZ20, YS20, DGK+20, FCG20]. In particular, we consider the problem of ﬁtting a single nonlinear activation to a joint distribution on Rn R. When the nonlinear activation is Re LU, this problem is referred to as Re LU regression or agnostically learning a Re LU. When the nonlinear activation is sign and the labels are Boolean, this problem is equivalent to the well-studied challenge of agnostically learning a halfspace [KKMS08].

We consider arguably the simplest possible setting when the marginal distribution is Gaussian and give the ﬁrst statistical-query lower bounds for learning broad classes of nonlinear activations. The statistical-query model is a well-studied framework for analyzing the sample complexity of learning problems and captures most known learning algorithms. For common activations such as Re LU, sigmoid, and sign, we give complementary upper bounds, showing that our results cannot be signiﬁcantly improved.

Let H be a function class on Rn, and let D be a labeled distribution on Rn R such that the marginal on Rn is D = N(0, In). We say that a learner learns H under D with error ϵ if it outputs a function

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

f such that E (x,y) D[f(x)y] max f H E (x,y) D[f (x)y] ϵ.

One can show that this loss captures 0-1 error in the Boolean case, as well as squared loss in the Re LU case whenever the learner is required to output a nontrivial hypothesis (i.e., a hypothesis with norm bounded below by some constant c > 0). (See Appendices I and J for details.)

For Re LU regression, we obtain the following exponential lower bound: Theorem 1.1. Let HRe LU be the class of Re LUs on Rn with unit weight vectors. Suppose that there is an SQ learner capable of learning HRe LU under D with error ϵ using q(n, ϵ, τ) queries of tolerance τ. Then for any ϵ, there exists τ = n (1/ϵ)b such that q(n, ϵ, τ) 2ncϵ for some 0 < b, c < 1/2. That is, a learner must either use tolerance smaller than n (1/ϵ)b or more than 2ncϵ queries.

Prior work due to Goel et al. [GKK19] gave a quasipolynomial SQ lower bound (with respect to correlational queries) for Re LU regression when the learner is required to output a Re LU as its hypothesis.

For the sigmoid activation we obtain the following lower bound: Theorem 1.2. Consider the above setup with Hσ, the class of unit-weight sigmoid units on Rn. For any ϵ, there exists τ = n Θ(log2 1/ϵ) such that q(n, ϵ, τ) 2ncϵ for some 0 < c < 1/2.

We are not aware of any prior work on the hardness of agnostically learning a sigmoid with respect to Gaussian marginals.

For the case of halfspaces, a classic result of Kalai et al. [KKMS08] showed that any halfspace can be agnostically learned with respect to Gaussian marginals in time and sample complexity n O(1/ϵ4), which was later improved to n O(1/ϵ2) [DKN10]. The only known hardness result for this problem is due to Klivans and Kothari [KK14] who gave a quasipolynomial lower bound based on the hardness of learning sparse parity with noise. Here we give the ﬁrst exponential lower bound: Theorem 1.3. Consider the above setup with Hhs, the class of unit-weight halfspaces on Rn. For any ϵ, there exists τ = n Θ(1/ϵ) such that q(n, ϵ, τ) 2ncϵ for some ﬁxed constant 0 < c < 1/2.

Since it takes Θ(1/τ 2) samples to simulate a query of tolerance τ, our constraint on τ here can be interpreted as saying that to avoid the exponential query lower bound, one needs sample complexity at least Θ(1/τ 2) = nΘ(1/ϵ), nearly matching the upper bound of [KKMS08, DKN10].

These results are formally stated and proved in Section 5. More generally, we show in Appendix C that our results give superpolynomial SQ lower bounds for agnostically learning any non-polynomial activation. (See Appendix A for some discussion of subtleties in interpreting these bounds.)

A notable property of our lower bounds is that they hold for general statistical queries. As noted by several authors [APVZ14, VW19], proving SQ lower bounds for real-valued learning problems often requires further restrictions on the types of queries the learner is allowed to make (e.g., correlational or Lipschitz queries).

Another consequence of our framework is the ﬁrst SQ lower bound for agnostically learning monomials with respect to Gaussian marginals. In contrast, for the realizable (noiseless) setting, recent work due to Andoni et al. [ADHV19] gave an attribute-efﬁcient SQ algorithm for learning monomials. They left open the problem of making their results noise-tolerant. We show that in the agnostic setting, no efﬁcient SQ algorithm exists; the proof is in Appendix B. Theorem 1.4. Consider the above setup with Hmon, the class of multilinear monomials of degree at most d on Rn. For any ϵ exp( Θ(d)) and τ ϵ2, q(n, ϵ, τ) nΘ(d)τ 5/2.

Our Approach Our approach deviates from the standard template for proving SQ lower bounds and may be of independent interest. In almost all prior work, SQ lower bounds are derived by constructing a sufﬁciently large family of nearly orthogonal functions with respect to the underlying marginal distribution. Instead, we will use a reduction-based approach:

We show that an algorithm for agnostically learning a single nonlinear activation φ can be used as a subroutine for learning depth-two neural networks of the form ψ(P

where ψ is any monotone, Lipschitz activation. This reduction involves an application of functional gradient descent via the Frank Wolfe method with respect to a (nonstandard) convex surrogate loss.

We apply recent work due to [DKKZ20] and [GGJ+20] that gives SQ lower bounds for learning depth-two neural networks of the above form in the probabilistic concept model. For technical reasons, our lower bound depends on the norms of these depth-two networks, and we explicitly calculate them for Re LU and sigmoid.

We prove that the above reduction can be performed using only statistical queries. To do so, we make use of some subtle properties of the surrogate loss and the functional gradient method itself.

Our reduction implies the following new relationship between two well-studied models of learning: if concept class C is efﬁciently agnostically learnable, then the class of monotone, Lipschitz functions of linear combinations of C is learnable in the probabilistic concept model due to Kearns and Schapire [KS94]. We cannot hope to further strengthen the conclusion to agnostic learnability of monotone, Lipschitz functions of combinations of C: the concept class of literals is agnostically learnable, but we show exponential SQ lower bounds for agnostically learning the class of majorities of literals, i.e., halfspaces (see also [KK14]).

Related Work Several recent papers have considered the computational complexity of learning simple neural networks [Bac17, GKKT17, YS20, FCG20, KK14, LSSS14, SVWX17, VW19, GKK19, GGJ+20, DKKZ20]. The above works either consider one-layer neural networks (as opposed to learning single neurons), or make use of discrete distributions (rather than Gaussian marginals), or hold for narrower classes of algorithms (rather than SQ algorithms). Goel et al. [GKK19] give a quasipolynomial correlational SQ lower bound for proper agnostic learning of Re LUs with respect to Gaussian marginals. They additionally give a similar computational lower bound assuming the hardness of learning sparse parity with noise.

The idea of using functional gradient descent to learn one hidden layer neural networks appears in work due to Bach [Bac17], who considered an incremental conditional gradient algorithm that at each iteration implicitly requires an agnostic learner to complete a Frank Wolfe step. A key idea in our work is to optimize with respect to a particular convex functional (surrogate loss) in order to obtain SQ learnability for depth-two neural networks with a nonlinear output activation. We can then leverage SQ lower bounds for this broader class of neural networks.

Functional gradient descent or gradient boosting methods have been used frequently in learning theory, especially in online learning (see e.g., [Fri01, MBBF00, SF12, BHKL15, Haz16].)

For Boolean functions, the idea to use boosting to learn majorities of a base class appeared in Jackson [Jac97], who boosted a weak parity learning algorithm in order to learn thresholds of parities (TOP). Agnostic, distribution-speciﬁc boosting algorithms for Boolean functions have appeared in works due to Kalai and Kanade [KK09] and also Feldman [Fel10]. Agnostic boosting in the context of the SQ model is explored in [Fel12], where an SQ lower bound is given for agnostically learning monotone conjunctions with respect to the uniform distribution on the Boolean hypercube.

The SQ lower bounds we obtain for agnostically learning halfspaces can be derived using one of the above boosting algorithms due to Kalai and Kanade [KK09] or Feldman [Fel10] in place of functional gradient descent, as halfspaces are Boolean functions.

Independent Work Independently and concurrently, Diakonikolas et al. [DKZ20] have obtained similar results for agnostically learning halfspaces and Re LUs. Rather than using a reduction-based approach, they construct a hard family of Boolean functions. They show that an agnostic learner for halfspaces or Re LUs would yield a learner for this family, which would solve a hard unsupervised distribution-learning problem considered in [DKS17]. Quantitatively, the lower bound they obtain is that agnostic learning of halfspaces or Re LUs up to excess error ϵ using queries of tolerance n poly(1/ϵ) requires at least npoly(1/ϵ) queries. These results are technically incomparable with ours. For queries of similar tolerance, our bound of 2ncϵ scales exponentially with n whereas theirs only scales polynomially, so that for any constant ϵ our bound is exponentially stronger. But our bound does not scale directly with 1/ϵ (other than via the induced constraint on tolerance, which does scale as n poly(1/ϵ)). Our work also extends to general non-polynomial activations, while theirs does not.

Organization We cover the essential deﬁnitions, models and existing lower bounds that we need in the preliminaries. Our main reduction, which says that if we could agnostically learn a single neuron, then we could learn depth-two neural networks composed of such neurons, is set up as follows. In Section 3 we explain our usage of functional gradient descent, with Assumption 3.1 formally stating the kind of agnostic learning guarantee we require for a single neuron. The main reduction itself is Theorem 4.1, the subject of Section 4. In Section 5 we derive the formal lower bounds which follow as a consequence of our reduction. Finally in Section 6, we contrast these lower bounds by also including some simple upper bounds. The more technical proofs may be found in the appendices.

2 Preliminaries

Notation Let D be a distribution over Rn, which for us will be the standard Gaussian N(0, In) throughout. We will work with the L2 space L2(Rn, D) of functions from Rn to R, with the inner product given by f, g D = ED[fg]. The corresponding norm is f D = p

ED[f 2]. We refer to the ball of radius R as BD(R) = {f L2(Rn, D) | f D R}. We will omit the subscripts when the meaning is clear from context. Given vectors u, v Rn, we will refer to their Euclidean dot product by u v and the Euclidean norm by u 2. Given a function ℓ(a, b) we denote its partial derivative with respect to its ﬁrst parameter, ℓ

a(a, b), by 1ℓ(a, b).

A Boolean probabilistic concept, or p-concept, is a function that maps each point x to a random { 1}-valued label y in such a way that E[y|x] = f (x) for a ﬁxed function f : Rn [ 1, 1], known as its conditional mean function. We will use Df to refer to the (unique) induced labeled distribution on Rn { 1}, i.e. we say (x, y) Df if the marginal distribution of x is D and E[y|x] = f (x). We also sometimes use y f (x) to say that y { 1} and E[y|x] = f (x).

Statistical Query (SQ) Model A statistical query is speciﬁed by a query function φ : Rn R [ 1, 1]. Given a labeled distribution D on Rn R, the SQ model allows access to an SQ oracle (known as the STAT oracle in the SQ literature) that accepts a query φ of speciﬁed tolerance τ, and responds with a value in [E(x,y) D[φ(x, y)] τ, E(x,y) D[φ(x, y)] + τ].

Let C be a class of Boolean p-concepts over Rn, and let D be a distribution on Rn. We say that a learner learns C with respect to D up to L2 error ϵ if, given only SQ oracle access to Df for some unknown f C, and using arbitrary queries, it is able to output f : Rn [ 1, 1] such that f f D ϵ. It is worth emphasizing that a query to Df takes in a Boolean rather than a real-valued label, i.e. is really of the form φ : Rn { 1} [ 1, 1]. In contrast, a query to a generic distribution D on Rn R takes in real-valued labels, and in Assumption 3.1 we deﬁne a form of learning that operates in this more generic setting.

One of the chief features of the SQ model is that one can give strong information theoretic lower bounds on learning a class C in terms of its so-called statistical dimension, which can be thought of as roughly measuring how many highly-orthogonal functions C contains. Deﬁnition 2.1. Let D be a distribution on Rn, and let C be a real-valued or Boolean concept class on Rn. The average (un-normalized) correlation of C is deﬁned to be ρD(C) = 1 |C|2 P

c,c C | c, c D|. The statistical dimension on average at threshold γ, SDAD(C, γ), is the largest d such that for all C C with |C | |C|/d, ρD(C ) γ.

In the p-concept setting, lower bounds against general queries in terms of SDA were ﬁrst formally shown in [GGJ+20]. Theorem 2.2 ([GGJ+20], Cor. 4.6). Let D be a distribution on Rn, and let C be a p-concept class on Rn. Say our queries are of tolerance τ, the ﬁnal desired L2 error is ϵ, and that the functions in C satisfy f β for all f C. For technical reasons, we will require τ ϵ2, ϵ β/3 (see Appendix A for some discussion). Then learning C up to L2 error ϵ (we may pick ϵ as large as β/3) requires at least SDAD(C, τ 2) queries of tolerance τ.

A recent result of Diakonikolas et al [DKKZ20] gave the following construction of one-layer neural networks on Rn with k hidden units, i.e. functions of the form g(x) = ψ(Pk i=1 aiφ(x wi)) for activation functions ψ, φ : R R and weights wi Rn, ai R. Theorem 2.3 ([DKKZ20]). There exists a class G of one-layer neural networks on Rn with k hidden units such that for some universal constant 0 < c < 1/2 and γ = nΘ(k(c 1/2)), SDA(G, γ) 2nc.

This holds for any ψ : R [ 1, 1] that is odd, and φ L2(R, N(0, 1)) that has a nonzero Hermite coefﬁcient of degree greater than k/2. Further, the weights satisfy |ai| = 1/k and wi 2 = 1 for all i.

We will be interested in the following special cases. Proofs of the norm lower bounds are in Appendix D. Corollary 2.4. For the following instantiations of G, with accompanying norm lower bound β (i.e. such that g β for all g G), there exist τ = n Θ(k) and ϵ τ such that learning G up to L2

error ϵ requires at least 2nc queries of tolerance τ, for some 0 < c < 1/2.

(a) Re LU nets: ψ = tanh, φ = Re LU. Then β = Ω(1/k6) (Lemma D.4), so we may take ϵ = Θ(1/k6). (b) Sigmoid nets: ψ = tanh, φ = σ. Then β = exp( O(

k)) (Lemma D.6), so we may take ϵ = exp( Θ(

k)). (c) Majority of halfspaces: ψ = φ = sign. Being Boolean functions, here β = 1 exactly, so we may take ϵ = Θ(1).

Convex Optimization Basics Over a general inner product space Z, a function p : Z R is convex if for all α [0, 1] and z, z Z, p(αz + (1 α)z ) αp(z) + (1 α)p(z ). We say that s Z is a subgradient of p at z if p(z + h) p(z) s, h . We say that p is β-smoothly convex if for all z, h Z and any subgradient s of p at z,

p(z + h) p(z) s, h β

If there is a unique subgradient of p at z, we simply refer to it as the gradient p(z). It is easily proven that smoothly convex functions have unique subgradients at all points. Another standard property is the following: for any z, z Z,

p(z) p(z ) p(z), z z 1

2β p(z) p(z ) 2. (1)

In this paper we will be concerned with convex optimization using the Frank Wolfe variant of gradient descent, also known as conditional gradient descent. In order to eventually apply this framework to improper learning, we will consider a slight generalization of the standard setup. Let Z Z both be compact, convex subsets of our generic inner product space. Say we have a β-smoothly convex function p : Z R, and we want to solve minz Z p(z), i.e. optimize over the smaller domain, while allowing ourselves the freedom of ﬁnding subgradients that lie in the larger Z. The Frank Wolfe algorithm in this improper setting is Algorithm 1.

Algorithm 1 Frank Wolfe gradient descent over a generic inner product space

Start with an arbitrary z0 Z. for t = 0, . . . , T do

Let γt = 2 t+2. Find s Z such that s, p(zt) maxs Z s , p(zt) 1

2δγt Cp. Let zt+1 = (1 γt)zt + γts. end for

The following theorem holds by standard analysis (see e.g. [Jag13]). For convenience, we provide a self-contained proof in Appendix G. Theorem 2.5. Let Z Z be convex sets, and let p : Z R be a β-smoothly convex function. Let Cp = β diam(Z)2. For every t, the iterates of Algorithm 1 satisfy

p(zt) min z Z p(z ) 2Cp

t + 2(1 + δ).

3 Functional gradient descent

Let ℓ: R R R be a loss function. Given a p-concept f and its corresponding labeled distribution Df , the population loss of a function f : Rn R is given by L(f) = E(x,y) Df [ℓ(f(x), y)]. We

will view L as a mapping from L2(Rn, D) to R, and refer to it as the loss functional. The general idea of functional gradient descent is to try to ﬁnd an f in a class of functions F that minimizes L(f) by performing gradient descent in function space. When using Frank Wolfe gradient descent, the key step in every iteration is to ﬁnd the vector that has the greatest projection along the negative gradient, which amounts to solving a linear optimization problem over the domain. When F is the convex hull conv(H) of a simpler class H, this can be done using a sufﬁciently powerful agnostic learning primitive for H. Thus we can boost such a primitive in a black-box manner to minimize L(f).

One reason for using Frank Wolfe rather than say standard projected GD is because in the L2 function space in which we operate, it is not natural to require the learner to ﬁnd a projection of the functional gradient onto conv(H). Another important reason is that Frank Wolfe uses a linear optimization subproblem in each step, and this is important in order to carry out the entire reduction purely using statistical queries.

Let H L2(Rn, D) be a base hypothesis class for which we have an agnostic learner with the following guarantee: Assumption 3.1. There is an SQ learner for H with the following guarantee. Let D be any labeled distribution on Rn R such that the marginal on Rn is D = N(0, In). Given only SQ access to D, the learner outputs a function f B(diam(H)/2) such that

E (x,y) D[f(x)y] max f H E (x,y) D[f (x)y] ϵ

using q(n, ϵ, τ) queries of tolerance τ.

Notice that we do not require f to lie in H, i.e. the learner is allowed to be improper, but we do require it to have norm at most diam(H)/2. This is to make the competitive guarantee against H meaningful, since otherwise the correlation can be made to scale arbitrarily with the norm.

With such an H in place, we deﬁne F = conv(H). We assume that f F. Our objective will be to agnostically learn F: to solve minf F L(f) in such a way that L(f) L(f ) ϵ.

To be able to use Frank Wolfe, we require some assumptions on the loss function ℓ. Assumption 3.2. The loss function ℓ: R R R is β-smoothly convex in its ﬁrst parameter.

As we show in Appendix H, from this assumption it easily follows that the loss functional L is itself β-smoothly convex, and that the gradient of L at f is given by L(f) : x 7 Ey f (x)[ 1ℓ(f(x), y)].

Example 3.3. The canonical example is the squared loss functional, with ℓsq(a, b) = (a b)2, which is 2-smoothly convex. Here the gradient has a very simple form, since 1ℓsq(a, b) = 2(a b), and so

E y f (x)[ 1ℓsq(f(x), y)] = E y f (x)[2(f(x) y)] = 2(f(x) f (x)),

i.e. Lsq(f) = 2(f f ). In fact, it is easily calculated that Lsq(f) = E(x,y) Df [(f(x) y)2] = f 2 2 f, f + 1. It is also useful to note that

Lsq(f) Lsq(f ) = f f 2. (2)

Frank Wolfe using statistical queries We see that our loss functional is a β-smoothly convex functional on the space L2(Rn, D). We can now use Frank Wolfe if we can solve its main subproblem: ﬁnding an approximate solution to maxh F h, L(f) , where f is the current hypothesis during some iteration. Since this is a linear optimization objective and F = conv(H), this is the same as solving maxh H h, L(f) . This is almost the guarantee that Assumption 3.1 gives us, but some care is in order. What we have SQ access to is the labeled distribution Df on Rn { 1}. It is not clear that we can rewrite the optimization objective in such a way that

max h H E x D[ h(x) L(f)(x)] = max h H E (x,y ) D[h(x)y ] (3)

for some distribution D on Rn R that we can simulate SQ access to. Naively, we might try to do this by letting D be the distribution of (x, L(f)(x)) for x D, so that a query φ : R R R to D can be answered with E(x,y ) D[φ(x, y )] = Ex D[φ(x, L(f)(x))]. But the issue is that in general L(f)(x) will depend on f (x), which we do not know all we have access to is Df .

It turns out that for the loss functions we are interested in, we can indeed ﬁnd a suitable such D. We turn to the details now.

4 Functional gradient descent guarantees on surrogate loss

The functional GD approach applied directly to squared loss would allow us to learn F = conv(H) using a learner for H (that satisﬁed Assumption 3.1). But by considering a certain surrogate loss, we can use the same learner to actually learn ψ F = {ψ f | f F} for an outer activation function ψ. This is particularly useful as we can now capture p-concepts corresponding to functions in F by using a suitable ψ : R [ 1, 1]. For example, the common softmax activation corresponds to taking ψ = tanh.

Assume that E[y|x] = ψ(f (x)) for some activation ψ : R R which is non-decreasing and λ-Lipschitz. Instead of the squared loss, we will consider the following surrogate loss:

ℓsur(a, b) = Z a

0 (ψ(u) b)du.

One can show that ℓsur is in fact λ-smoothly convex, and that 1ℓsur(a, b) = ψ(a) b. The gradient of the surrogate loss functional, Lsur(f) = E(x,y) Dψ f [ℓsur(f(x), y)], is given by

Lsur(f) : x 7 E y ψ(f (x))[ 1ℓsur(f(x), y)] = ψ(f(x)) ψ(f (x)),

i.e. Lsur(f) = ψ f ψ f .

We still need to show that the Frank Wolfe subproblem can be solved using access to just Dψ f . Observe that

E x D[ h(x) Lsur(f)(x)] = E x D [h(x)(ψ(f (x)) ψ(f(x)))]

h(x) E y ψ(f (x))[y] ψ(f(x))

= E (x,y) Dψ f [h(x)(y ψ(f(x)))]

= E (x,y ) D[h(x)y ],

where D is the distribution of (x, y ψ(f(x))) for (x, y) Dψ f . We can easily simulate SQ access to this using Dψ f : if φ is any query to D, then

E (x,y ) D[φ(x, y )] = E (x,y) Dψ f [φ(x, y ψ(f(x)))] = E (x,y) Dψ f [φ (x, y)] (4)

for the modiﬁed query φ (x, y) = φ(x, y ψ(f(x))). This means we can rewrite the optimization objective to ﬁt the form in Eq. (3). Thus for our surrogate loss, Assumption 3.1 allows us to solve the Frank Wolfe subproblem, giving us Algorithm 2 for learning F.

Algorithm 2 Frank Wolfe for solving minf F Lsur(f)

Start with an arbitrary f0 B(diam(H)/2). for t = 0, . . . , T do

Let γt be 2 t+2. Let Dt be the distribution of (x, y ψ(ft(x))) for (x, y) Dψ f . Using Assumption 3.1, ﬁnd h B(diam(H)/2) such that

E (x,y ) Dt[h(x)y ] max h H E (x,y ) Dt[h (x)y ] 1

2γtλ diam(H)2

Let ft+1 = (1 γt)ft + γth. end for

Theorem 4.1. Let H be a class for which Assumption 3.1 holds, and let F = conv(H). Given SQ access to Dψ f for a known non-decreasing λ-Lipschitz activation ψ and an unknown f F, suppose we wish to learn ψ f in terms of surrogate loss, i.e. to minimize Lsur(f). Then after T iterations of Algorithm 2, we have the following guarantee:

Lsur(f T ) Lsur(f ) 4λ diam(H)2

In particular, we can achieve Lsur(f T ) Lsur(f ) ϵ after T = O( λ diam(H)2

ϵ ) iterations. Assuming our queries are of tolerance τ, the total number of queries used is at most Tq(n, ϵ

O( λ diam(H)2

Proof. By the preceding discussion, the surrogate loss functional is λ-smoothly convex, and Algorithm 2 is a valid special case of Algorithm 1, with Z = B(diam(H)/2) and Z = conv(F). Thus the guarantee follows directly from Theorem 2.5 (setting δ = 1).

To bound the number of queries, observe that it is sufﬁcient to run for T = 4λ diam(H)2

ϵ 2 rounds. In the tth iteration, we invoke Assumption 3.1 with

2γtλ diam(H)2 = λ diam(H)2

t + 2 λ diam(H)2

Since q(n, ϵ , τ) q(n, ϵ

4, τ), the bound follows.

Lastly, we can show that minimizing surrogate loss also minimizes the squared loss. Observe ﬁrst that Lsur(f ) = 0. Thus, applying Eq. (1) with z = f and z = f, we get

Lsur(f) Lsur(f ) 1

2λ Lsur(f) Lsur(f ) 2

2λ ψ f ψ f 2 (5)

2λ(Lsq(ψ f) Lsq(ψ f )),

where Lsq is squared loss wrt Dψ f and the last equality is Eq. (2). In particular, Eq. (5) implies that ψ f achieves the following L2 error with respect to ψ f :

2λ (Lsur(f) Lsur(f )). (6)

5 Lower bounds on learning Re LUs, sigmoids, and halfspaces

The machinery so far has shown that if we could agnostically learn a single unit (e.g. a Re LU or a sigmoid), we could learn depth-two neural networks composed of such units. Since we have lower bounds on the latter problem, this yields the following lower bounds on the former. Theorem 5.1. Let HRe LU = {x 7 Re LU(w x) | w 2 1} be the class of Re LUs on Rn with unit weight vectors.1 Suppose that Assumption 3.1 holds for HRe LU. Then for any ϵ, there exists τ = n Θ(ϵ 1/12) such that q(n, ϵ, τ) 2ncϵ for some 0 < c < 1/2.

Proof. Since all our lower bound proofs are similar, to set a template we lay out all the steps as clearly as possible.

Consider the class G from Theorem 2.3 instantiated with ψ = tanh (which is 1-Lipschitz, so λ = 1) and φ = Re LU. By the conditions on the weights, we see that G tanh FRe LU, where FRe LU = conv(HRe LU). This construction has a free parameter k, which we will set based on ϵ. By our main reduction (Assumption 3.1 and Theorem 4.1), we can learn tanh FRe LU with respect to Lsur up to agnostic error ϵ using O( 1

4, τ)) queries of tolerance τ. By Eq. (6), this implies learning G up to L2 error

2ϵ. We know that learning G should be hard. Speciﬁcally, Corollary 2.4(a) states that if ϵ = Θ(1/k6) and the queries are of tolerance τ = n Θ(k), then learning up to L2 error ϵ

should require 2nc queries. The loss our reduction achieves is ϵ =

2ϵ, so we require

2ϵ Θ(1/k6) for the bound to hold. Accordingly, we pick k = Θ(ϵ 1/12), so that τ = n Θ(k) = n Θ(ϵ 1/12). Thus we must have 1

4, τ) 2nc. Rearranging and rescaling ϵ gives the result.

1We use Re LU for simplicity. Any learner can handle this by doing a bit ﬂip on its own.

Theorem 5.2. Let Hσ = {x 7 σ(w x) | w 2 1}, where σ is the standard sigmoid, be the class of sigmoid units on Rn with unit weight vectors. Suppose that Assumption 3.1 holds for Hσ. Then for any ϵ, there exists τ = n Θ((log 1/ϵ)2) such that q(n, ϵ, τ) 2ncϵ for some 0 < c < 1/2.

Proof. Very similar to the above. We instantiate G with ψ = tanh, φ = σ, and observe that G tanh conv(Hσ) and that diam(Hσ) 2. In this case, Corollary 2.4(b) tells us that we require

k) for the lower bound to hold, so we pick k = (log 1/ϵ)2. The result now follows exactly as before.

We also obtain a lower bound on the class of halfspaces. Note that for Boolean functions, functional GD is not essential; existing distribution-speciﬁc boosting methods [KK09, Fel10] can also give us similar results here.

Theorem 5.3. Let Hhs = {x 7 sign(w x) | w 2 1} be the class of halfspaces on Rn with unit weight vectors. Suppose that Assumption 3.1 holds for Hhs. Then for any ϵ, there exists τ = n Θ(1/ϵ) such that q(n, ϵ, τ) 2ncϵ3 for some 0 < c < 1/2.

Proof. To approximate the sign function using a Lipschitz function, we deﬁne g sign(x) to be 1 for x 1/k, 1 for x 1/k, and linearly interpolate in between. This function is (k/2)-Lipschitz. We claim that G instantiated with ψ = φ = sign satisﬁes G g sign conv(Hhs), with diam(G) = 2. This is because as noted in Theorem 2.3, G has weights ai { 1/k}, so the sum of halfspaces inside ψ is always a multiple of 1/k, and g sign behaves the same as sign.

Theorem 4.1 now lets us learn G up to agnostic error ϵ (and hence L2 error

2kϵ, by Eq. (6)) using O( k2

4, τ)) queries of tolerance τ. By Corollary 2.4(c), we only need

2kϵ Θ(1) for the lower bound to hold, so we may take k = Θ(1/ϵ) to get a lower bound of 2nc. Thus k2

4, τ) 2nc, and rearrangement gives the result.

6 Upper bounds on learning Re LUs and sigmoids

We use a variant of the classic low-degree algorithm ([LMN93]; see also [KKMS08]) to provide simple upper bounds for agnostically learning Re LUs and sigmoids. With respect to D = N(0, In), the δ-approximate degree of a function f : Rn R is the smallest d such that there exists a degree-d polynomial p satisfying f p δ. In Appendix E, we show that for any class C of δ-approximate degree d, picking δ = O(ϵ) and simply estimating the Hermite coefﬁcients of x 7 E[y|x] up to degree d yields an agnostic learner up to error ϵ. For simplicity we assume that y [ 1, 1].

Theorem 6.1. The class HRe LU can be agnostically learned up to ϵ using n O(ϵ 4/3) queries of tolerance n Θ(ϵ 4/3)ϵ. Similarly, Hσ can be learned using n O(log2 1/ϵ) queries of tolerance n Θ(log2 1/ϵ)ϵ2.

Proof. Approximating the Hermite coefﬁcients of degree at most d takes n O(d) queries of tolerance n Θ(d)ϵ. As we show in Appendix F, the δ-approximate degree of unit-weight Re LUs is O((1/δ)4/3) of unit-weight sigmoids is O(log2 1/δ). The guarantees follow by the argument in Appendix E.

We note that our lower bounds for Re LUs and sigmoids were for queries of tolerance n Θ(ϵ 1/12)

and n Θ(log2 1/ϵ) respectively, which nearly matches these upper bounds.

Acknowledgements

We thank the anonymous reviewers for their feedback.

Broader Impact

As deep learning techniques continue to grow rapidly in real-world usage and importance, the study of their theoretical guarantees has become important as well. This paper gives theoretical results concerning some basic primitives of neural networks, and as such contributes to our growing rigorous understanding of deep learning. The lower bounds in this paper address a model of learning in which there is arbitrary noise in the labels, as is common in some settings. They add to the body of work demonstrating that even very simple neural networks can fail dramatically to have good theoretical guarantees, providing some guidance as to conditions under which neural networks might perform particularly poorly even in practice.

Another area of practical importance that our results relate to is that of differentially private data analysis/release, where the SQ complexity of agnostic learning is known to characterize the complexity of privately answering a class of queries up to small error [GHRU13, DR14].

SG was supported by the JP Morgan AI Ph D Fellowship. AG was supported by NSF awards AF1909204 and AF-1717896, and a UT Austin Provost s Fellowship. AK was supported by NSF awards AF-1909204, AF-1717896, and the NSF AI Institute for Foundations of Machine Learning (IFML). Work done while visiting the Institute for Advanced Study, Princeton, NJ.

[ADHV19] Alexandr Andoni, Rishabh Dudeja, Daniel Hsu, and Kiran Vodrahalli. Attribute-efﬁcient learning of monomials over highly-correlated variables. In Algorithmic Learning Theory, pages 127 161, 2019. 1

[APVZ14] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning sparse polynomial functions. In Proceedings of the twenty-ﬁfth annual ACM-SIAM symposium on Discrete algorithms, pages 500 510. SIAM, 2014. 1

[Bac17] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629 681, 2017. 1

[BHKL15] Alina Beygelzimer, Elad Hazan, Satyen Kale, and Haipeng Luo. Online gradient boosting. In Advances in neural information processing systems, pages 2458 2466, 2015. 1

[Boy84] John P Boyd. Asymptotic coefﬁcients of hermite function series. Journal of Computational Physics, 54(3):382 410, 1984. F

[DFGK17] Carlos M Da Fonseca, M Lawrence Glasser, and Victor Kowalenko. Basic trigonometric power sums with applications. The Ramanujan Journal, 42(2):401 428, 2017. D

[DGK+20] Ilias Diakonikolas, Surbhi Goel, Sushrut Karmalkar, Adam Klivans, and Mahdi Soltanolkotabi. Approximation Schemes for Re LU Regression. In Conference on Learning Theory, 2020. To appear. 1

[DKKZ20] Ilias Diakonikolas, Daniel Kane, Vasilis Kontonis, and Nikos Zariﬁs. Algorithms and SQ Lower Bounds for PAC Learning One-Hidden-Layer Re LU Networks. In Conference on Learning Theory, 2020. To appear. 1, 1, 1, 2, 2.3, C, D

[DKN10] Ilias Diakonikolas, Daniel M Kane, and Jelani Nelson. Bounded independence fools degree-2 threshold functions. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 11 20. IEEE, 2010. 1, 1

[DKS17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73 84. IEEE, 2017. 1

[DKZ20] Ilias Diakonikolas, Daniel Kane, and Nikos Zariﬁs. Near-Optimal SQ Lower Bounds for Agnostically Learning Halfspaces and Re LUs under Gaussian Marginals. In Advances in Neural Information Processing Systems, 2020. 1

[DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014. 6

[FCG20] Spencer Frei, Yuan Cao, and Quanquan Gu. Agnostic learning of a single neuron with gradient descent. ar Xiv preprint ar Xiv:2005.14426, 2020. 1, 1

[Fel10] Vitaly Feldman. Distribution-speciﬁc agnostic boosting. In Andrew Chi-Chih Yao, editor, Innovations in Computer Science - ICS 2010, Tsinghua University, Beijing, China, January 5-7, 2010. Proceedings, pages 241 250. Tsinghua University Press, 2010. 1, 5

[Fel12] Vitaly Feldman. A complete characterization of statistical query learning with applications to evolvability. Journal of Computer and System Sciences, 78(5):1444 1459, 2012. 1

[Fri01] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189 1232, 2001. 1

[GGJ+20] Surbhi Goel, Aravind Gollakota, Zhihan Jin, Sushrut Karmalkar, and Adam Klivans. Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent. In International Conference on Machine Learning, 2020. To appear. 1, 1, 2, 2.2, A.1, A.2, 1, 2, B, D, D, D.5

[GHRU13] Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. Privately releasing conjunctions and the statistical query barrier. SIAM Journal on Computing, 42(4):1494 1520, 2013. 6

[GKK19] Surbhi Goel, Sushrut Karmalkar, and Adam Klivans. Time/Accuracy Tradeoffs for Learning a Re LU with respect to Gaussian Marginals. In Advances in Neural Information Processing Systems, pages 8582 8591, 2019. 1, 1, 1

[GKKT17] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. In Conference on Learning Theory, pages 1004 1042, 2017. 1

[Haz16] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016. 1

[Hil40] Einar Hille. Contributions to the theory of hermitian series ii. the representation problem. Transactions of the American Mathematical Society, 47(1):80 94, 1940. F

[Jac97] Jeffrey C. Jackson. An efﬁcient membership-query algorithm for learning dnf with respect to the uniform distribution. J. Comput. Syst. Sci, 55(3):414 440, 1997. 1

[Jag13] Martin Jaggi. Revisiting Frank Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, pages 427 435, 2013. 2, G

[KK09] Varun Kanade and Adam Kalai. Potential-based agnostic boosting. In Advances in neural information processing systems, pages 880 888, 2009. 1, 5

[KK14] Adam Klivans and Pravesh Kothari. Embedding hard learning problems into gaussian space. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM), 2014. 1, 1, 1

[KKMS08] Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777 1805, 2008. 1, 1, 1, 6, E

[KS94] Michael J Kearns and Robert E Schapire. Efﬁcient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48(3):464 497, 1994. 1

[LMN93] Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, fourier transform, and learnability. Journal of the ACM (JACM), 40(3):607 620, 1993. 6

[LSSS14] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efﬁciency of training neural networks. In Advances in neural information processing systems, pages 855 863, 2014. 1

[MBBF00] Llew Mason, Jonathan Baxter, Peter L Bartlett, and Marcus R Frean. Boosting algorithms as gradient descent. In Advances in neural information processing systems, pages 512 518, 2000. 1

[PSG19] Abhishek Panigrahi, Abhishek Shetty, and Navin Goyal. Effect of activation functions on the training of overparametrized neural nets. ar Xiv preprint ar Xiv:1908.05660, 2019. F [SF12] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012. 1 [SVWX17] Le Song, Santosh Vempala, John Wilmes, and Bo Xie. On the complexity of learning neural networks. In Advances in neural information processing systems, pages 5514 5522, 2017. 1 [VW19] Santosh Vempala and John Wilmes. Gradient descent for one-hidden-layer neural networks: Polynomial convergence and sq lower bounds. In COLT, volume 99, 2019. 1, 1 [YS19] Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understanding neural networks. In Advances in Neural Information Processing Systems, pages 6594 6604, 2019. 1 [YS20] Gilad Yehudai and Ohad Shamir. Learning a single neuron with gradient methods. ar Xiv preprint ar Xiv:2001.05205, 2020. 1, 1