# depth_separation_beyond_radial_functions__2d2b60ae.pdf

Journal of Machine Learning Research 23 (2022) 1-56 Submitted 9/21; Revised 3/22; Published 3/22

Depth separation beyond radial functions

Luca Venturi venturi@cims.nyu.edu Courant Institute of Mathematical Sciences New York University New York, NY 10012, USA

Samy Jelassi sjelassi@princeton.edu Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08540, USA

Tristan Ozuch ozuch@mit.edu Department of Mathematics Massachusetts Institute of Technology Cambridge, MA 02142, USA

Joan Bruna bruna@cims.nyu.edu Courant Institute of Mathematical Sciences and Center for Data Science New York University New York, NY 10011, USA

Editor: Julien Mairal

High-dimensional depth separation results for neural networks show that certain functions can be eﬃciently approximated by two-hidden-layer networks but not by one-hidden-layer ones in high-dimensions. Existing results of this type mainly focus on functions with an underlying radial or one-dimensional structure, which are usually not encountered in practice. The ﬁrst contribution of this paper is to extend such results to a more general class of functions, namely functions with piece-wise oscillatory structure, by building on the proof strategy of (Eldan and Shamir, 2016). We complement these results by showing that, if the domain radius and the rate of oscillation of the objective function are constant, then approximation by one-hidden-layer networks holds at a poly(d) rate for any ﬁxed error threshold.

The mentioned results show that one-hidden-layer networks fail to approximate highenergy functions whose Fourier representation is spread in the frequency domain, while they succeed at approximating functions having a sparse Fourier representation. However, the choice of the domain represents a source of gaps between these positive and negative approximation results. We conclude the paper focusing on a compact approximation domain, namely the sphere Sd 1 in dimension d, where we provide a characterization of both functions which are eﬃciently approximable by one-hidden-layer networks and of functions which are provably not, in terms of their Fourier expansion.

Keywords: Neural networks, Depth separation

c 2022 Luca Venturi, Samy Jelassi, Tristan Ozuch and Joan Bruna.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-1109.html.

Venturi, Jelassi, Ozuch, Bruna

1. Introduction

Learning in high-dimensions is a challenging task for computational, statistical and approximation reasons. Even in the classic supervised learning setup, current empirical successes of deep learning algorithms remain largely out of reach for existing theories, despite phenomenal recent progress. Amongst the algorithmic aspects enabling this success, depth remains a major non-negotiable element. Depth in structured neural networks such as convolutional neural networks provides a multiscale processing of information, but more generally it deﬁnes an intricate function class with powerful approximation biases. Understanding the beneﬁts of depth for approximating certain functions of interest represents a long-standing problem. The classic result of the universal approximation theorem ensures approximation by neural networks of any continuous function, but it focuses on shallow (that is, one-hidden-layer) models and does not provide any approximation rates. The seminal work (Barron, 1993) provides quadratic approximation rates by shallow networks under a condition of sparsity of the Fourier transform. Recent works (Eldan and Shamir, 2016; Daniely, 2017) suggest that this property (sparsity of Fourier transform) is essentially necessary in order to recover polynomial approximation rates, by constructing examples of deep networks which are spread in direction and away from zero in the frequency regime, and by showing that these function can not be eﬃciently approximated by a shallow counterpart. These depth-separation phenomena occur in the high-dimensional regime, where approximation by neural networks of standard Sobolev spaces is cursed (see e.g. (Maiorov and Meir, 2000)). On the other hand, proofs of such high-dimensional depth-separation phenomena are currently limited to radial functions, that is of the form f(x) = ϕ( Ax + b 2). In this work we extend the results just cited. We describe rates of approximation by one-hidden-layer networks in terms of the number of units N of the network, by looking at the Fourier representation of the function to be approximated. We consider two types of approximation rate, inspired by the work (Safran et al., 2019): (i) the rate of approximation is polynomial in both the input dimension d and the error estimation ϵ, that is N poly(d, ϵ 1) we refer to this rate of approximation as universal approximation (ii) for any ﬁxed error threshold ϵ, the number of units N needed for approximation of approximation depends at most polynomially on d, that is N poly(d) for any ﬁxed error threshold ϵ we refer to this rate of approximation as ﬁxed-threshold approximation. We distinguish two fundamentally diﬀerent regimes of approximation: relative to a heavy-tailed, unbounded data distribution, or relative to a concentrated distribution. Whereas the former captures the most general setup, the latter is motivated by practical machine learning applications. Our contributions are as follows.

First, we consider a class of two-hidden-layer networks exhibiting piece-wise oscillatory behavior, namely functions of the form

fr,w,v : x Rd 7 e2πir (v T x + w T x+) .

In section 3, we show that, under appropriately heavy-tailed data distributions, approximation at a rate N poly(d) cannot hold (unconditionally on the weights of the approximant network), as long as the rate of oscillations r grows faster than d. On the other hand, fr,w,v can be universally approximated (that is, at a rate poly(d, ϵ 1))

Depth separation beyond radial functions

by a two-hidden-layer network with any practical activation of choice. The proof of this result (Theorem 4) extends the main idea introduced by the results of Eldan and Shamir (Eldan and Shamir, 2016) beyond the radial case.

In section 4, we show that the poly(d)-oscillatory aspect and the heavy-tailed data distributions are necessary in the depth-separation result mentioned above. More speciﬁcally, we show that any deep network, with O(1)-bounded weights and O(1)- Lipschitz activation, can be ﬁxed-threshold approximated by one-hidden-neural networks over a compact set of radius O(1) (Theorem 11). This extends an equivalent result in (Safran et al., 2019), from the class of radial functions to the one of deep neural networks with H older activations.

Aforementioned depth separation results consider functions whose Fourier representation is spread in high frequencies. On the other hand, universal approximation results often require the function to be approximated to be, in some sense, sparse in the Fourier domain. Unfortunately, there are currently many gaps between these two types of results, one of them being the deﬁnition of approximation domain. In order to reduce the gap between the two results above, we consider approximation on a ﬁxed compact domain, namely the unit sphere Sd 1, where Fourier analysis can be done using spherical harmonics. We individuate two conditions on the spherical harmonics decomposition of a function f C(Sd 1). The ﬁrst is a sparsity condition on the decomposition, which we show to be suﬃcient to prove universal approximation (that is, at a rate N poly(d, ϵ 1)) of f by one-hidden-layer networks. The second is a high-energy spreadness condition on the spherical harmonics decomposition of f, which we show to imply that universal approximation of f by one-hidden-layer networks cannot hold. This is the content of section 5, of which the main results are summarized in section 5.2.

1.1 Related works

There is a huge literature of approximation results for neural networks. Early approximation results provided upper and lower bounds on the approximation of some functional spaces such as Sobolev spaces (Maiorov and Meir, 2000) or Lp spaces (Pinkus, 1999) by neural networks. For high input dimensions d, such results hold for functions with smoothness proportional to d, or require an approximation rate that scales as N ϵ d (see e.g. (Petersen, 2020; G uhring et al., 2020) for a review), where N denotes the number of units of the network and ϵ the error threshold. In more recent years, quite a few works pointed out the beneﬁts of deep networks versus their shallow counterparts from the point of view of approximation rates. For example, this has been shown for sawtooth function (Telgarsky, 2016), functions with positive curvature (Liang and Srikant, 2016; Yarotsky, 2017; Safran and Shamir, 2017), functions with a compositional structure (Poggio et al., 2017), piecewise smooth functions (Petersen and Voigtlaender, 2018), Gaussian mixture models (Jalali et al., 2019), polynomials (Rolnick and Tegmark, 2017), or model reduction models (Rim et al., 2020). The result of (Telgarsky, 2016) has been further generalized using a notion of periodicity (Chatziafratis et al., 2019). It must be noticed that most of the cited works show depth separation that is inde-

Venturi, Jelassi, Ozuch, Bruna

pendent of the dimension d and that increases exponentially with the depth of the network. Another line of works (Eldan and Shamir, 2016; Daniely, 2017; Safran et al., 2019) on the other hand shows depth separation exponential in the dimension d, between networks with one and two hidden layers. This is the framework of this work. It was also shown recently that depth separation results between ﬁxed depths greater than this are arguably diﬃcult to prove (Vardi and Shamir, 2020; Vardi et al., 2021). This depth-width trade-oﬀhas been analyzed through diﬀerent lens than approximation capabilities, such as classiﬁcation capabilities (Malach and Shalev-Shwartz, 2019), exact representability (Arora et al., 2016), Betti numbers (Bianchini and Scarselli, 2014), number of linear regions (Pascanu et al., 2013; Montufar et al., 2014; Raghu et al., 2017; Hanin and Rolnick, 2019a,b), trajectory lengths (Raghu et al., 2017), globale curvature (Poole et al., 2016) or topological entropy (Bu et al., 2020). In essence, all these results state that networks expressivity improve exponentially as we increase the depth. Another related question is whether depth-separation holds from a learnability (therefore, not solely approximation) point of view as well (Malach and Shalev-Shwartz, 2019; Malach et al., 2021). In this work we focus on approximation and we consider the Fourier representation as a complexity measure. This is the approach followed by e.g. (Eldan and Shamir, 2016; Daniely, 2017), which construct examples of deep neural networks, whose Fourier energy is exponentially higher than those of shallow neural networks with a moderate number of units. On the other hand, sparsity of the Fourier transform has been used to show polynomial rates of approximation of functions by neural networks (Klusowski and Barron, 2018; Ongie et al., 2019; Bresler and Nagaraj, 2020). In the last part of the paper, we show that an equivalent condition can be described in terms of spherical harmonics decomposition.

2. Preliminaries

2.1 Neural networks

For L 1, we call an L-hidden-layer feed-forward neural network a function

f : x Rd x(L+1)(x) Cd L+1 , (1)

where x(L) is deﬁned by recursion by x(0)(x) = x,

x(k)(x) = σ(k)(A(k)x(k 1)(x)) for k [L] and x(L+1)(x) = A(L+1)x(L)(x) ,

A(k) = [a(k) 1 | |a(k) dk ]T Rdk dk 1 for k [L],

A(L+1) = [a(L+1) 1 | |a(L+1) d L+1 ]T Cd L+1 d L

(with d0 = d) and σ(k) : Rdk Rdk are activation functions, that is σ(k)(x)

i = σ(k) i (xi)

for some function σ(k) i : R R. A neural network is therefore a sequence of sums and compositions of ridge functions, that is functions of the form x 7 σ(w T x). In the following, unless speciﬁed, we only consider neural networks (or, more simply, networks) as deﬁned in

Depth separation beyond radial functions

(1). Most of the times we will deal with real-valued networks, that is A(L+1) Rd L+1 d L. We say that a network has activation σ if σ(k) i (x) = σ(x + bk i ) for some bias term bk i R for all k, i. We refer to the function

x Rdk 1 7 σ(k)(A(k)x) Rdk

as k-th hidden (or inner) layer of width dk, for k [L], while we refer to the linear function deﬁned by A(L+1) as the last (or L + 1-th) layer. We refer to the value W(f) .= maxk [L] dk as width of the network f and to the vectors a(k) i as weights (of the k-th layer), for all k, i. A basic complexity measure for neural network (1) is given by the total number of units, or size:

The number of layers L(f) = L is also a relevant measure of complexity, which we refer to as depth. Finally, in the following we sometimes require a control on the value of the weights; such controls are expressed in terms of norm p of the weights, that is

mp(f) .= max k,i ak,i p ,

for some p [1, ].

2.2 Neural network approximation rates

We measure the approximation error between two functions f, g : Ω Rd C in terms of the L2(µ) norm (with respect to a probability measure or density µ)

f g 2 µ,2 .= Z

Ω |f(x) g(x)|2 dµ(x) ,

or L norm f g Ω, .= sup x Ω |f(x) g(x)| .

Notice that a (uniform) L2 lower bound implies a L one, and viceversa for an upper bound. The focus of this chapter is to establish upper and lower bounds for approximation of certain function classes by shallow neural networks, in high dimensions d. We distinguish two diﬀerent approximation regimes of interest. In the following we will denote by Fσ N the space of one-hidden-layer neural networks f N : Rd R with width at most N and activation σ (where the dimension d is inferred from the context); similarly, we will denote by FN the space of one-hidden-layer neural networks with width at most N and any (continuous) activation.

Deﬁnition 1 We say that a sequence f(d) : Ωd Rd C

d 2 is universally approximable by one-hidden-layer networks with activation σ if it is approximable at a poly(d, ϵ 1) rate; that is if there exists some constants α > 0 and β > 0 such that it holds f(d) f N Ωd, ϵ

for some one-hidden-layer f N Fσ N satisfying N + m (f N) α(dϵ 1)β.

Venturi, Jelassi, Ozuch, Bruna

Deﬁnition 2 We say that f(d)

d is ﬁxed-threshold approximable if for any ϵ (0, 1) it is ϵ-approximable at a poly(d) rate; that is if for any ϵ > 0 there exists some constants α > 0 and β > 0 such that it holds f(d) f N Ωd, ϵ

for some one-hidden-layer f N Fσ N satisfying N + m (f N) αdβ.

These approximation schemes were introduced in (Safran et al., 2019). To ensure signiﬁcance of the approximation rates, in the following upper and lower bounds are stated for objective functions f(d) normalized such that f(d) 2 1 or f(d) 1. Notice that some of the inapproximability results shown below are for target networks with complex values. Although, one can obtain equivalent results with a real-valued target network by simply taking the real (or imaginary) part of such complex-valued target networks.

2.3 Activation assumptions

Finally, the results in the next sections generally hold for activations satisfying the following assumptions, which are satisﬁed by common activation such as the Re LU Re LU(x) = x+ or the sigmoid sigmoid(x) = (1 + e x) 1 (Eldan and Shamir, 2016). Most of the results can be easily generalized to hold under less strict conditions, but we take these assumptions for sake of simplicity.

Assumption 1 Given an activation σ : R R, there exist constants ισ and νσ such that

1. it is ισ-Lipschitz and σ(0) ισ;

2. for any L-Lipschitz function f : R R constant outside of an interval [ R, R] and any ϵ > 0 there exits f N Fσ N with f f N ϵ such that N +w (f N) νσLRϵ 1.

Notice that this assumption implies that, given a (deep) neural network f with poly(d) weights and activations satisfying Assumption 1, then we are always able to replace the activations in f by any other activation satisfying Assumption 1, by paying an at most polynomial cost. This is formalized in the following lemma.

Lemma 3 Let f(d) : Kd Rd C

d be neural networks with activations satisfying Assumption 1 and such that N(f(d)) + w (f(d)) + diam(K(d)) poly(d); also let σ be any activation function satisfying Assumption 1. Then the sequence f(d)

d is universally approximable by networks (of the same depth as f(d)) with activation σ.

2.4 Notation

We introduce notation we use throughout the rest of the paper. We denote scalar valued variables as lowercase non-bold; vector valued variables as lowercase bold; matrix and tensor valued variables and multivariate random variables (r.v. s) as uppercase bold. Given a vector v Rd, we denote its components as vk; given a matrix W Rn m, we denote its columns as wk. For a matrix W, we denote by W F,p its entrywise p-norm, by W p,q its (p, q) operator norm (that is W p,q = max y p=1 Wy q) and by W p its p operator

Depth separation beyond radial functions

norm (that is W p = W p,p). We denote by Sd 1 Rn the (d 1)-dimensional sphere x Rd : x 2 = 1 and by Bd r,p the ℓp ball of radius r in Rd, that is x Rd : x p r . We denote by Lp(Ω), Lp(µ), Lp(ϕ) the spaces of functions f : Ω R which are p-integrable with respect to the Lebesgue measure, the measure µ or the density ϕ, respectively. The respective norms (and scalar products for p = 2) are denoted by f ζ,p ( f, g ζ) for ζ {Ω, µ, ϕ}; we simply write f p when the measure is clear from the context. For a ﬁnite signed Borel measure µ, we denote its total variation as µ 1. Finally, we denote by ˆf or F(f) (resp. ˇf or F (f)) the Fourier transform (resp. the inverse Fourier transform) of f (meant in the following in the sense of tempered distributions).

3. A depth separation example

Our starting point for the study of depth-separation is to consider a generic data distribution µ with adversarial properties against shallow approximations. In the seminal work (Eldan and Shamir, 2016), Eldan and Shamir establish an unconditional (with no restrictions on the norms of the weights of the network) depth-separation result by considering a density µ in Rd with tails µ( x 2) x (d+1)/2 2 and a radial function f(d)(x) = hd( x 2) with hd : R R a carefully chosen oscillating function with compact support. The proof in (Eldan and Shamir, 2016) reveals the limitations of shallow neural networks at approximating high-dimensional functions via a powerful harmonic analysis insight, that is particularly convenient in the setting of radial functions. In this section, we show that their proof strategy can be extended to include more diverse function classes, namely those arising naturally from Re LU networks. Speciﬁcally, we consider networks of the form

fr,w,v : x Rd 7 σr v T x + w T x+ (2)

where x+ denotes the element-wise Re LU activation, v, w Rd and σr(t) = e2πirt. We are thus considering a function which is piece-wise oscillatory, with constant envelope |fr,w,v(x)| = 1, and where the frequency of oscillations is controlled by r. The main result of this section can be summarized as follows.

Theorem 4 (Informal) Assume that w 2 = Θ(1), v 2 = O(1) and that r = Θ(dk) for some k 2. Then there exists a (low-decay) product measure µ on Rd such that the function fr,w,v is universally approximable by two-hidden-layer networks but it is not ﬁxed-threshold approximable by one-hidden-layer networks.

3.1 The lower bound

Let ψ L2(R) L1(R) with ψ 2 = 1, and such that its Fourier transform ˆψ is compactly supported in [ K, K], for some K > 0. Assume also that

The condition ensure that the density ψ is suﬃciently spread away from zero (see Remark 8). Our ﬁrst objective is to establish depth separation for the approximation of fr,w,v under the L2 metric deﬁned by the probability density ϕ2, where ϕ : x Rd 7 Qd j=1 ψ(xj).

Venturi, Jelassi, Ozuch, Bruna

Theorem 5 Let f(d) = frd,wd,vd, for some rd R, wd, vd Rd. For a ﬁxed γ > 0, deﬁne

τd .= sup S [d] vd + wd,S , Ωd .= j [d] : rd|wd,j| γd2 and ηd .= |Ωd|

where wd,S Rd is deﬁned by wd,S,i = wi1{i S}. Assume that

(i) oscillations grow polynomially, that is τd rd = Θ(dk) for some constant k > 0;

(ii) the vectors wd are suﬃciently spread, that is ηd η for some η > 0 independent of d;

(iii) the density ϕ2 is suﬃciently spread, i.e. 2K ψ 2 1 < 22η.

Then there exists a constant α (0, 1) (independent of d) such that for all d it holds

inf f N FN f(d) f N 2 ϕ2,2 1 N αd O(dk+1) . (4)

Notice that this lower bound is unconditional on the weights of the neurons m (f N).

The proof follows a similar strategy as in the work (Eldan and Shamir, 2016). The approximation error can be expressed in the Fourier domain as

frd,wd,vd f N 2 ϕ2,2 = frd,wd,vd ϕ f N ϕ 2 2 = ˆfrd,wd,vd ˆϕ ˆ f N ˆϕ 2 2 .

Thanks to the assumptions, the target function frd,wd,vd satisﬁes a key property, namely that its Fourier transform has its energy suﬃciently spread in the high-frequencies, after the convolution by ˆϕ. Such frequency spread is caused by the shattering of the ﬁrst Re LU layer, which eﬀectively creates Θ(2ηd) diﬀerent frequencies. The piece-wise structure arising from the Re LU can be handled in the Fourier domain by the Hilbert transform of the function ψ, which has suﬃcient decay thanks to the assumptions. Noticing that ˆfrd,wd,vd ˆϕ 2 = 1, this is formalized in the following.

Lemma 6 (Informal) It holds that ˆfrd,wd,vd ˆϕ (ξ) 2 ηd ϕ 1 ξ 1 for ξ poly(d) .

On the other hand, since ˆϕ is compactly supported and the Fourier transform of a single-unit network is localised in a frequency ray, the Fourier transform of frd,wd,vd ϕ is localised in a union of N tubes, of the form Tα = span({α}) + [ K, K]d. This implies that

inf f N FN frd,wd,vd f N 2 ϕ2,2 inf f N T(N) frd,wd,vd f N 2 ϕ2,2

where T(N) denotes the set of L2 functions such that their Fourier transform is supported on the union of N tubes Tα1, . . . , TαN as above, for some arbitrary α1, . . . , αN Rd. Thanks to Plancherel s identity, and since frd,wd,vd ϕ2,2 = 1, it further holds that

inf f N T(N) frd,wd,vd f N 2 ϕ2,2 1 N sup α Sd 1

1Tα ˆfrd,wd,vd ˆϕ 2

where 1Tα denotes the indicator function of Tα. Lemma 6 can then be used to show that such projections are exponentially (in d) small, which implies equation (11). The detailed proof is deferred to section A.1.

Depth separation beyond radial functions

Remark 7 Theorem 5 asks for two main conditions to hold. First, the magnitude of oscillations of the objective function (parametrised by rd) must grow at least polynomially with d, similarly to the assumptions in the works (Eldan and Shamir, 2016) and (Daniely, 2017). Second, the data distribution µ with density ϕ2 should be heavy-tailed, in order for its Fourier transform to be suﬃciently localised. When rd does not grow fast enough with d, the energy starts piling up at the low frequencies, creating an important roadblock to establish approximation lower-bounds, and leaving open the possibility of eﬃcient shallow approximation. Similarly, when µ concentrates too quickly, the proof strategy also fails, due to the fact that in that case ˆϕ is too spread in the Fourier domain, creating full overlap of the energies.

Remark 8 The admissibility condition (3) is necessary since η 1 by deﬁnition. Notice that 1 = ψ 2 2 = ˆψ 2 2 (2K) ˆψ 2 (2K) ψ 2 1 and therefore condition (3) can be considered as a requirement on ψ not being too concentrated in the origin. The choice ψ(t) = p

3/2 sinc2(πt) corresponds to K = 1, ψ 1 = p

3/2 and ψ 2 = 1, which veriﬁes (3). In that case, from condition (ii) we need η > log2 3

2 0.79 . However, the choice ψ(t) = Csinc(πt) (the equivalent separable version of the of density considered in (Eldan and Shamir, 2016)) is not admissible, since ψ is not in L1. The lower bound may be optimized by ﬁnding compactly supported windows with an optimal L1 to L2 ratio of their Fourier transforms.

Remark 9 The theorem considers a separable Re LU transform x 7 x+, combined with a separable data distribution µ with density ϕ2. One could expect a similar lower bound to apply in the more general case of a layer of the form x 7 (Ux + b)+, U Rd d, b Rd . Such general case replaces the Hilbert transform of ψ with the Fourier transform of indicators of convex polytopes, which has been used in the context of Re LU networks to characterize spectral properties (Rahaman et al., 2019).

Example 1 We give an explicit example of a family of function f(d) : Rd R which satisfy the assumptions of Theorem 5. Consider the functions

f(d)(x) = exp

k=1 max{0, xk}

Then, if µd is the product probability measure deﬁned by the density in Remark 8, that is

2sinc4(πxk) dxk

then it holds that

f N(x) f(d)(x) 2

µd,2 1 1300N d2 (0.75)d .

For example, this implies that

f N(x) f(d)(x) µd,2 1

Venturi, Jelassi, Ozuch, Bruna

The numbers are obtained by explicitly tracking the constant in the proof of Theorem 5 (see section A.1 for more details). Finally, notice that the functions f(d) are not radial. Indeed, they show diﬀerent behaviour over each orthant, thanks to the Re LU layer. On the other hand, radial functions would behave equally over any orthant. A similar reasoning would generally hold for radiality in certain directions of the input.

3.2 The upper bound

According to the deﬁnition of neural networks we gave in section 2.1, the function fr,w,v is naturally a two-hidden-layer neural network. Although, while there are cases of sinusoidal activations being used in practice, activations such as Re LU or sigmoid are more relevant to practical applications. The following theorem, proved in section A.2, shows that we can eﬃciently represent the function fr,w,v in the hypothesis of the Theorem 5 as a two-hiddenlayer neural network with ﬁxed activation, such as the Re LU or the sigmoid. The main technical diﬀerence with Lemma 3 is that the result is proved for approximation w.r.t. the probability measure with density ϕ2 introduced above.

Theorem 10 Let σ be an activation satisfying Assumption 1. Assume that there exists a constant k 1 such that m (frd,vd,wd) O(dk) and assume that ψ is such that |ψ(x)| = O(|x| 1). Then, for every ϵ > 0, there exists f N Fσ N with

N + m (f N) O d2(1+k)ϵ 3/2 such that f N frd,wd,vd 2 ϕ2,2 ϵ .

Theorems 5 and 10 therefore estabilish a depth separation result. If f(d) = frd,wd,vd are deﬁned with rd, wd, vd satisfying the assumptions of both theorems (that is, they satisfy assumptions (i)-(ii)-(iii) of Theorem 5 with τd rd = Θ(dk)), then Theorem 5 says that f(d)

d is not ﬁxed-threshold approximable by one-hidden-layer networks, while Theorem 10 says that the sequence is universally approximable by two-hidden-layer networks with a ﬁxed activation satisfying Assumption 1. For example, the family of functions considered in Example 1 satisﬁes such assumptions. We thus identify two key aspects responsible for such depth separation: heavy-tailed data and oscillations growing with dimension. In the next sections we want to understand how necessary these two conditions are. The next section shows that if these two condition do not hold anymore, then a lower bound such as the one in Theorem 5 is not achievable; more speciﬁcally we show that the objective function is ﬁxed-threshold approximable by one-hidden-layer networks.

4. Approximation of deep networks by shallow ones

In this section, we show that any deep neural network f (which include the target functions considered in the previous section) can be approximated by shallow ones at a rate which is polynomial in d, as long as the rate of oscillation in the inner layers of f is constant in d and the metric is concentrated in a ball of constant radius. We start by reporting the

Depth separation beyond radial functions

result in a general form for two-hidden-layer networks and we discuss some consequences and extensions afterwards. Consider a family of two-hidden-layers neural network {f(d) : Kd Rd C} of the form f(d) : x Rd 7 γT d g WT d h UT d x C , (5)

where h = h(d) : Rpd Rpd and g = g(d) : Rod Rod are, respectively, component-wise 1Lipschitz and (1, α)-Holder1 activation functions, and Ud Rd pd, Wd Rpd od, γd Cod. We wish to approximate f(d) by one-hidden-layer neural networks with a given activation.

Theorem 11 Assume that diam(Kd) = O(1) and that the networks f(d) have ℓ1 bounded weights, that is m1(f(d)) = O(1). Then, for every activation σ satisfying Assumption 1.2 and every ϵ (0, 1) it holds that

inf fσ N Fσ N f(d) fσ N K, ϵ for some N exp O ϵ 1 2/α log(pd/ϵ) .

Moreover, fσ N can be chosen such that m (fσ N) (1 + N2) = exp O ϵ 1 2/α log(pd/ϵ) .

The proof is constructive and based on the following observation. Consider the case where od = 1, γd = 1, pd = p and g(x) = xr some positive integer r. If hk(x) = eix for all k [p], then the function f = f(d) at (5) has form

k=1 wkeiu T k x !r

for some w RN, uk Rd, where N = p. By expanding the power we can write

wj1 1 wj N N ei( PN h=1 jhwh) T x ,

that is a formulation of f as a one-hidden-layer network with activation σ1(t) = e2πit (in the following we refer to this type of networks as shallow Fourier networks) and a number of units that scales as Nr. Since both polynomials and trigonometric polynomials are universal approximators, with well known convergence rates, in the general case one can proceed as follows. Each of the non-linearities applied to the ﬁrst hidden layer can be approximated by a trigonometric polynomial at a polynomial rate on the interval of interest. Similarly, every non-linearity applied to the second hidden layer can be approximated by a polynomial at a linear (in the degree of the polynomial) rate on the interval of interest. Assuming for simplicity that both rates behave as ϵ 1, where ϵ > 0 denotes the approximation error, the composition of the two approximation following the structure of the target network results in a shallow Fourier network (that is with activation σ1(t) = e2πit) whose size N behaves, roughly speaking, as

N Θ pϵ 2 ϵ 1 .

1. We say that a function g : R R is (1, α)-Holder if it holds that |g(x) g(y)| |x y|α for all x, y R.

Venturi, Jelassi, Ozuch, Bruna

Moreover, it is also possible to control the value of the coeﬃcients appearing in the ﬁnal approximation. With this, we can approximate each summand in the shallow Fourier network by a one-hidden-layer network with activation σ with a controlled number of units, thanks to Assumption 1.2. A more detailed statement and a formal proof are reported in appendix B. In essence, in the Theorem 11, we show that it is possible to ﬁxed-threshold approximate a two-hidden-layer neural network with constant(d) oscillations at a poly(d) rate over a compact set of constant(d) radius. On the other hand, it easy to show that it is also possible to obtain approximation at a poly(ϵ 1) rate (see section B.7), for ﬁxed d. Finally, existing results in the literature (see (Safran et al., 2019)) show that universal approximation is not possible, the counterexample being essentially a radial function. Interestingly, the upper bound in Theorem 11 does not depend on the number of units in the second layer of the objective function. This parameter is hidden in the control we impose on the ℓ1 norm of the objective weights. The proof technique of this upper bound highlights how the diﬃculty of approximating at poly(d, ϵ 1) rate stems from the highenergy of the second layer, which requires the shallow network used for approximation to have a (potentially) exponential (in d) number of directions. Notice that the lower bound in Theorem 5 actually tells that the function is not ﬁxed-threshold approximable. High oscillations in the lower bound (4) essentially ensure that an exponential (in d) number of neurons are necessary. An open question is then whether a low-decaying measure is, in general, necessary for such a result to hold. Expanding on the proof technique above, it is possible to extend the result of Theorem 11 to approximation of L-hidden-layers networks by shallow ones, which gives a rate scaling as exp(O(ϵ L log(p/ϵ))).

Theorem 12 Let f(d) as in (1), with O(1)-Lipschitz activations, ﬁrst hidden layer width d1 = pd, depth Ld = L and bounded weights, that is m1(f(d)) = O(1). Then for every ϵ > 0 there exists a shallow Fourier network f N Fσ N with

N pd O 1 + 1

such that f(d) f N Bd 1, , ϵ .

See section B.6 for a formal statement and its proof. While it has been shown that generic O(1)-Lipschitz function can not be (computably) represented by neural networks with N poly(d) units (Vardi et al., 2021), an interesting related follow-up conjecture is whether our result can be generalized to any generic O(1)-Lipschitz function which is poly(d)-computable. Notice that this is dependent on the choice of the uniform norm to measure the approximation error. For example, it has been shown that a rate N poly(d) is achievable for approximation in the L2 norm with the uniform measure (Hsu et al., 2021). Finally, notice that the approximation rate shown in Theorem 11 and Theorem 12 are actually polynomial in the size pd of the ﬁrst hidden layer of f(d) rather than in the input dimension d. Although, up to choosing a worse (yet constant) exponent in ϵ, we can replace pd by d in the statement, by considering the function as a (L + 1)-hidden-layer network, where the ﬁrst layer is the identity.

Depth separation beyond radial functions

4.1 Two cases of interest

Theorem 11 allows to recover, for any ﬁxed threshold ϵ > 0, a poly(d) rate for the approximation of fr,w,v by one-hidden-layer networks and it can be seen as a generalization of Theorem 1 in (Safran et al., 2019). This is the content of the following corollaries.

Corollary 13 (Radial functions) Let f(d)(x) = ϕd( x 2), where ϕd : [ 1, 1] R are 1-Lipschitz, and Kd = Bd 1,2. Then, for any ϵ (0, 1) it holds that

inf fσ N Fσ N

fσ N f(d) Kd, ϵ for some N exp O ϵ 5 log(d/ϵ) .

Moreover, fσ N can be chosen so that m (fσ N) exp O ϵ 5 log(d/ϵ) .

Consider the functions f(d) : x Rd 7 eiw T d (Udx)+ for some wd Rpd, Ud Rpd d. This is a more general version of the function fr,w,v considered in section 3. If the weights are bounded, that is m1(f(d)) = O(1), then Theorem 11 implies the following.

Corollary 14 (Shallow approximation of (2)) If rd = O(1) and Kd = Bd rd,2, for any ϵ (0, 1) it holds that

inf fσ N Fσ N fσ N f(d) Kd, ϵ for some N exp O ϵ 2 log(pd/ϵ) .

Moreover, fσ N can be chosen so that m (fσ N) exp O ϵ 2 log(pd/ϵ) .

Although the result of Corollary 14 is established for approximation in the uniform norm over the unit ball, it is not diﬃcult to extend it to a result in L2 over a measure that concentrated over a compact set of constant (in d) radius, such as a normalized Gaussian. A formal statement of this fact, along with the proof, is reported in section B.5. Compared with the result of section 3, Corollary 14 implies the following. The function f(d) can be approximated, at a poly(d) rate over a compact set of constant radius if its weights wd, Ud are uniformly bounded. On the other hand, if the norm of the weights grows polynomially in d, then approximation at a poly(d) rate is not possible, under a polynomially slow decaying measure. An open question is whether approximation at a poly(d) rate is possible if only one of these two conditions hold, that is if either (1) the norm of the weights is constant but the measure is polynomially slow decaying or (2) the measure is concentrated over a compact set of constant radius but the norm of the weights grows (at least) polynomially.

5. Approximation by shallow networks: a spherical harmonics analysis

As already discussed, diﬃculties in approximating functions in high dimension by shallow networks appear when the function has a Fourier transform spread in a (exponential) number of directions in (polynomial) high energy. On the other hand, the presence of only one of these two conditions is not enough to prevent eﬃcient approximability. While the previous results highlight this, the lower bound presented in Theorem 5 applies to a speciﬁc choice of error measure, with (polynomially) slowly decaying tails.

Venturi, Jelassi, Ozuch, Bruna

In this section, we aim to disentagle the role of the measure tail and understand how the Fourier representation can tell whether a function is eﬃciently approximable by a onehidden-layer network or not. In particular, we focus on approximation results for functions deﬁned over the (d 1)-dimensional sphere Sd 1, for which a rich literature of Fourier analysis is available. First, we give a suﬃcient condition on the target function in terms of its spherical harmonics decomposition to be not eﬃciently approximable by shallow one-hidden-layer networks. This condition captures a slowly decaying and suﬃciently spread spherical harmonic expansion. We also show that certain symmetry properties imply this condition. On the other hand, one may ask if a reverse statement holds. In this direction, building on existing theory, we provide a suﬃcient condition for approximation by one-hidden-layer networks.

5.1 Spherical harmonics decomposition

Let d 2 and Sd 1 (S when the dimension is clear from the context) be the uniform measure over Sd 1. The spherical harmonics are a particular orthonormal basis for L2(S). They consists of S k=0 span n Y d k,i o Nd k

= S k=0Hd k

where Y d k,i is a restriction to Sd 1 of an homogeneous harmonic polynomial of degree k. The projection operator over Hd k is given by

Pd k : f L2(S) 7 fk .=

i=1 f, Y d k,i Y d k,i .

Similarly, PI denotes the operator i IPd i , for any I N. The function fk is referred to as the degree k spherical harmonic component of the function f. Since the spherical harmonic form an orthonormal basis of L2 S, it holds that f = P k=0 fk and f 2 2 = P k=0 fk 2 2 for every f L2(S), where 2 denotes the norm in L2(S). As spherical harmonics decomposition can be seen as a generalization of Fourier series to dimensions d 3, in the following we refer to the spherical harmonics decomposition of a function as its Fourier representation, interchangeably. The operator Pk can be associated with a kernel given by

i=1 Y d k,i(x)Y d k,i(y) = Nd k P d k x T y

Nd k = (2k + d 2)(k + d 3)!

k!(d 2)! = Θ

kd (k + d)k+d

is the dimension of Hd k and P d k is the ((d 2)/2)-Gegenbauer polynomial deﬁned as

P d k (x) = k! Γ d 1

j=0 ( 1)j (1 x2)jxk 2j

4jj!(k 2j)!Γ j + d 1

Depth separation beyond radial functions

Let ωd be the Lebesgue area of the sphere:

The polynomials {(Nd k )1/2P d k }k 0 form a basis of orthonormal polynomials for L2(µd), where µd is the probability measure on [ 1, 1] deﬁned by

dµd(t) = αd(1 t2)(d 3)/2 dt ,

where αd = ωd 1/ωd = Θ(

d). Notice that, given a function f L2(S), it holds

fk(x) = Nd k

Sd 1 f(y)P d k x T y d S(y) .

Moreover, if the function f only depends on a linear projection of the input, the Funk-Hecke formula holds.

Theorem 15 (Funk-Hecke formula) For every σ : [ 1, 1] C such that x Sd 1 7 σ(x1) is in L2(S), and for every w Sd 1, it holds that Z

Sd 1 σ(w T x)P d k (ξT x) d S(x) = λk P d k (ξT w)

where λk = σ, P d k µd.

Functions of the form x Sd 1 7 αP d k (w T x)

for some α R and w Sd 1, are called zonal harmonics. By the Funk-Hecke formula it follows that Z

Sd 1 P d k (w T x)P d k (v T x) d S(x) = (Nd k ) 1P d k (w T v)

for any w, v Sd 1. This implies that Hd k has an RKHS structure with kernel K given by

K(v, w) .= Nd k P d k (v T w) .

In particular, zonal harmonics actually span Hd k. Moreover, it can be shown that there

exists w1, . . . , w Nd k Sd 1 such that Hd k = span( P d k (w T i ) Nd k i=1) (Efthimiou and Frye, 2014, Theorem 4.13). For these facts and more details about spherical harmonics we refer to the books (Atkinson and Han, 2012; Dai and Xu, 2013).

5.2 Concentration and spreadness in Hd k and main results

Intuitively, one can say function f C(Sd 1) is concentrated over Sd 1 if there is an area Ω Sd 1 such that the mass of f is concentrated over Ω. On the other hand one could say that f is spread if it assumes non-negligible values uniformly over the sphere. The

Venturi, Jelassi, Ozuch, Bruna

spreadness/concentration of the function f can be quantiﬁed by looking at ratios of the type

ℓq,p(f) .= f q

f p for 1 p < q . Since the norms above are with respect to a probability measure, it holds that ℓq,p 1. Intuitively, the closest this ratio is to 1, the more spread is the function. On the other hand, the largest this ratio, the more concentrated the function is. Consider the case of a function fk Hd k. Then, it holds that

The equality is attained for functions of the type fk(x) = αP d k (w T x) for some α C and w Sd 1, i.e. zonal harmonics. In this sense, zonal harmonics could be considered as the most concentrated functions in Hd k. A similar inequality can be shown for the quantity ℓ2,1: it holds that ℓ2,1(fk) q

for fk Hd k. Nevertheless, in this case, zonal harmonics do not attain equality; the inequality is actually not tight; a more detail discussion on this quantity is reported in section 5.4. Thanks to the Funk-Hecke formula, it holds that a one-hidden-layer f N FN, with hidden layer weights given by w1, . . . , w N, satisﬁes

j=1 αj P d k (w T j x)

for some α CN. In other words, its Fourier representation is concentrated along N directions. According to the remarks above, this implies that if the width N is relatively small, the Fourier components of the neural network f N are relatively concentrated in space. One would then expect that such concentration can be used to determine whether a function can be approximated eﬃciently by a one-hidden-layer neural network or not. In the next sections, we show that this is indeed the case. Let f C(Sd 1); assuming that f(d) k 2 poly(d, k 1), the results can be informally summarized as follows:

If the spherical components of f are (exponentially) spread in ℓ ,2 sense, that is, for example,

ℓ ,2(fk) ϵk q

Nd k = ϵk sup g Hd k ℓ ,2(g) for some ϵ (0, 1)

then f is provably not universally approximable by one-hidden-layer networks.

If the spherical components of f are (polynomially) concentrated in ℓ2,1 sense, that is, for example,

ℓ2,1(fk) poly(d 1, k 1) q

Nd k then f is universally approximable by one-hidden-layer networks.

Depth separation beyond radial functions

Notice that, on the other hand, if fk 2 decreases exponentially fast then universal approximation follows, and similarly if fk 2 decreases exponentially slower than a power of k 1

then universal approximation can not hold. The ﬁrst of the two conditions above expresses concentration of the Fourier decomposition, while the second expresses spreadness of the same. We notice at least two gaps between the two conditions. The ﬁrst one is the expression of the concentration phenomena: one is with respect to ℓ ,2, while the other one is with respect to ℓ2,1. Second, the two regimes above do not include many other possible ones. For example, we suspect the existence of a regime which prevents universal approximability but allows for ﬁxed-threshold one, a topic worth of future study. These results are properly formalized, stated and discussed in section 5.3 and section 5.4, respectively.

5.3 Inapproximability of functions with spread Fourier representation

As discussed above, one-hidden-layer functions have a zonal structure. In more detail, if h(x) = σ(w T x + b) for some w Sd 1 and b R, then it is easy to see that

hk(x) = sk hk 2 q

Nd k P d k (w T x)

with sk { 1}. In particular, it follows that hk = |hk( w)| = (Nd k )1/2 hk 2. This can be interpreted by saying that the Fourier components of single neurons are most concentrated (along the neuron direction) in space. Therefore, it is natural to expect that functions with spread Fourier decomposition are diﬃcult to approximate by neural networks. The proposition below formalizes this fact. The proof follows a technique similar to the one used in (Daniely, 2017) (see Remark 17 for a comparison) and essentially upper bounds the scalar product between the objective function and the network.

Proposition 16 Let f(d)

d a sequence of functions such that f(d) C(Sd 1) and M > 0. Assume that for every d there exists Id N such that

1. It holds that f(d) 2 O(d M) PIdf(d) 2 ;

2. There exists a non-negative sequence {cd,k}k Id such that f(d) k cd,k q

Nd k f(d) 2

for all k Id and such that P k Id c2 d,k 1/2 ϵdα O(d M) for some ϵ (0, 1) and α > 0.

Moreover, assume that f(d) = O(1) and f(d) 2 = Ω(d M) . Then the sequence f(d)

d>2 is not universally approximable by one-hidden-neural networks.

The proof of Proposition 16 is reported in section C.1. We discuss a few particular cases where the assumptions of Proposition 16 hold. Let f(d)

d>2 be a sequence of functions f(d) C(Sd 1).

Example 2 (Constant control on ℓ ,2) Assume that assumption 1 in Proposition 16 holds with Id = k N : k d2 and that f(d) 2 = Ω(d M) for some constant M > 0. If it holds that ℓ ,2(f(d) k ) ℓ

Venturi, Jelassi, Ozuch, Bruna

for all k d2 for some constant ℓ 1, then it is easy to check that Proposition 16 holds. This condition could be thought as the spherical harmonic components of the function f(d)

being uniformly spread for high energy (k d2). Indeed assumption 2 holds with

cd,k .= ℓ q

f(d) k 2 f(d) 2

k=d2 c2 d,k ℓ2

Nd d2 = O(d3 d) .

This is similar to the condition used in (Daniely, 2017), discussed in the remark below.

Remark 17 Daniely (Daniely, 2017) showed a depth-separation result using a result similar to Proposition 16. The diﬀerence in this case is that the author considers functions deﬁned on Sd 1 Sd 1. Although, since L2(Sd 1 Sd 1) = L2(Sd 1) L2(Sd 1), the space L2(Sd 1 Sd 1) admits a decomposition in spherical harmonics

L2(Sd 1 Sd 1) =

j,k=0 Hd j Hd k .

In particular, Daniely considers functions of the type

f(d) : (x, y) Sd 1 Sd 1 7 h(d)(x T y)

for some h(d) C([ 1, 1]). Such functions belong to P k=0 Hd k Hd k and satisfy

ℓ ,2(f(d) k,k) ℓ Nd k 1/2 = ℓ Nd k 1/2 ℓ k,k

where ℓ k,k = maxf Hd k Hd k ℓ ,2(f). The equation above resembles condition 2 in Proposition 16, since it implies that

f(d) k,k ℓ q

f(d) k,k 2 f(d) 2 ℓ k,k f(d) 2

f(d) k,k 2 f(d) 2

which, for kd d2 implies that cd d3 2 d. The proof is then concluded by choosing Id = {(k, k) : k kd}, since (using the same notations as in the proof of Proposition 16), it holds

f N f(d) 2 2 PIdf(d) 2 2 2 X

ℓ j,j 1|ui| f(d) j,j fσi,wi j,j 2

which is an equivalent of formula (44).

Depth separation beyond radial functions

Example 3 Assume that assumption 1 in Proposition 16 holds with Id = k N : k ρdβ

for some ρ > 0, β > 0 and that f(d) 2 = Ω(d M) for some constant M > 0. If it holds that ℓ ,2(f(d) k ) ϵk O(d M) q

for all k ρdβ for some constant M > 0, then Proposition 16 holds, since

k=ρdβ ϵk = ϵρdβ

This condition could also be thought as the spherical harmonic components of the function f(d) being uniformly spread for high energy (k d2), although in this case the spreadness is required to increase exponentially, as the degree increases, with respect to the maximum concetration achievable (that is (Nd k )1/2).

Example 4 (Invariant functions) Finally, we show that certain symmetry assumptions can imply energy spreadness. Consider the case of a sign-invariant function f C(Sd 1), that is such that f(ϵ x) = f(x) for every ϵ { 1}d and x Sd 1.

Lemma 18 Let f C(Sd 1) be a sign-invariant function. If

fk = sup ϵ { 1}d|fk(ϵ)| (7)

for some k 16d2 then it holds

fk 2 2 d/2 q

Nd k fk 2 .

Proof [Proof] Notice that since f is sign-invariant, so is fk. Consider the function

P : x Sd 1 7 2 d Nd k X

ϵ { 1}d P d k (ϵT x) .

The function P satisﬁes P 2 2 2 d/2 q

Nd k (see Lemma 46). Let ϵ { 1}d. Then it holds

fk = |fk(ϵ)| = | fk, P | P 2 fk 2 2 2 d/2 q

Nd k fk 2 .

This concludes the proof.

The statement of the above lemma therefore says that if f is sign-invariant and achieves maximum energy in a speciﬁc frequency then it satisﬁes Assumption 2 from Proposition 16. Under polynomial decay of fk 2, it should be possible to relax the condition (7) to ask for the frequency w(k) [0, )d such that fk = fk(w(k)) to satisfy

w(k) j poly(d 1) .

Venturi, Jelassi, Ozuch, Bruna

5.4 Eﬃcient approximation under a sparsity condition of the spherical harmonics decomposition

Works by Barron (Barron, 1993; Klusowski and Barron, 2018) essentially show that eﬃcient approximation holds under a sparsity condition on the Fourier transform of the function to approximate; more speciﬁcally, for f L1(Rd), the rate of (uniform) approximation is controlled by the quantity R

Rd w 2 1| ˆf(w)| dw. In this section we show that an equivalent control can be determined for approximation on the sphere, in terms of spherical harmonics decomposition. For technical reason, the result is estabilished for functions in ˆHd .= Hd 1 L k=1 Hd 2k (which correspond to the space of function in L2 S whose odd part is linear) and mainly for Re Lu activation. We brieﬂy discuss extensions to diﬀerent activation functions in Remark 24. Consider the space of homogeneous one-hidden-layer neural networks with Re LU activations:

FRe LU,0 N =

f : x Sd 1 7

k=1 uk w T k x

+ : u RN, wk Sd 1 )

Since w T x

every function in FRe LU,0 N is the sum of a linear function with an even one. In other words, FRe LU,0 N ˆHd. Since any linear function belongs to FRe LU,0 2 , it is equivalent to consider the problem of approximating even functions by homogeneous one-hidden-layer neural networks with activation abs(x) = |x|, that is, elements of the space

f : x Sd 1 7

k=1 uk w T k x : u RN, wk Sd 1 )

To study this, consider the corresponding functional space

H1 .= {hπ : π is a signed even Radon measure}

where hπ is deﬁned to be the function

hπ : x Sd 1 7 Z

w T x dπ(w) .

The space H1 is a Banach space endowed with the norm γ1(h) = infh : h=hπ π 1. As discussed in the introduction, the space H1 consists of functions which are eﬃciently approximable by one-hidden-layer networks. More formally, the following holds.

Theorem 19 (Bourgain et al. (1989)) Let f H1. Then it holds that

inf f N Fabs,0 N f f N cγ1(f)

where c > 0 is a numerical constant. Moreover, f N satisfying the bound can be chosen to satisfy γ1(f N) γ1(f).

Depth separation beyond radial functions

The question of interest can now be transposed to: which functions f C(Sd 1) have a (polynomially) small norm γ1(f)? One way to approach this problem is by the so-called Blaschke Levy operator. Consider the transformation

x T y ϕ(y) d S(y)

for functions ϕ C(Sd 1). T can be described in terms of spherical harmonics (Rubin, 1998) as

k 0 even σkϕk where σk = ( 1)1+k/2

2π Γ((k 1)/2)Γ(d/2)

Γ((k + d + 1)/2) .

In particular, it holds that the functional T is an automorphism of C even(Sd 1) (the set of even function in C (Sd 1)) (Rubin, 1998) . Clearly, its inverse can be deﬁned in terms of spherical harmonics by

T 1 : ϕ C even(Sd 1) 7 X

k 0 even σ 1 k ϕk .

The following is immediate.

Proposition 20 For any ϕ C even(Sd 1) it holds that ϕ H1 and

γ1(ϕ) = T 1ϕ 1 .

Using these results, we can proceed similarly to the work (Ongie et al., 2019) and obtain the following.

Proposition 21 Let f C(Sd 1) even. It holds that f H1 if and only if

sup ϕ C even(Sd 1) : ϕ 1 T 1ϕ, f < . (8)

In this case, γ1(f) = sup ϕ C even(Sd 1) : ϕ 1 T 1ϕ, f .

The proof of Proposition 21 is reported in section C.3. Functions that satisfy equation (8) include all even functions in Cd+2(Sd 1) if d is even and all even functions in Cd+3(Sd 1) if d is odd (Weil, 1976). This is inline with existing results that show approximability by neural networks for functions whose regularity is proportional to the dimension d (e.g. (Maiorov and Meir, 2000)). Given f C(Sd 1) even, the condition of Proposition 21 is implied by the (weak) convergence (as N ) of the series

k=0 σ 1 2k f2k

Venturi, Jelassi, Ozuch, Bruna

to a ﬁnite signed measure π. In this case f = hπ. In particular, a stronger condition is convergence in L1(S). This is implied if it holds that X

k 0 even |σk| 1 fk 1 < . (9)

Notice that, instead, the series converges in L2 S if and only if X

k 0 even σ2 k fk 2 2 < .

This is equivalent to asking that f H2, the RKHS given by the kernel function

k : (x, y) Sd 1 Sd 1 7 Z

x T w w T y d S(w) .

Since in this case H2 can be described as

H2 .= hπ : π is a signed even Radon measure with an L2 S density ,

it is clear that H1 H2. We refer to (Bach, 2017) for more details about these statements. On the other hand, notice that the condition (9) is potentially much stronger than simply asking for f H1.

Example 5 (Highly concentrated function) Some computations show that

|σk| 1 Θ d3/4k2 q

Using these observations it is then straightforward to prove the following.

Proposition 22 Let f(d)

d a sequence of even functions in C(Sd 1). Assume that there exist some constant M, N > 0 constant such that

Nd k f(d) k 1 O k Md N f(d) k 2 and

k=0 k M+2 f(d) k 2 = O(d N) .

Then the sequence f(d)

d 2 is universally approximable by the space Fabs,0 N .

Proof [Proof] By Proposition 20 and equation (10) above we get that

k 0 even |σk| 1 f(d) k 1 Θ(d N+3/4) X

k 0 even k2+M f(d) k 2 O(d3/4+2N) .

The application of Theorem 19 concludes the proof.

The proposition above requires essentially two conditions to hold. First, that the energy of the functions decreases fast enough (yet polynomially in k and d). The second condition is that the Fourier components of the function are concentrated enough, that is they are

Depth separation beyond radial functions

polynomially close to the bound (6). We remark that this condition is infact pretty strong; it requires the function f to be band-limited. According to (Dai et al., 2016), it holds that

f(d) k 2 C(d)k d 2

4 f(d) k 1 ,

for some function C(d). Then f(d) would satisfy

Nd k poly(k, d) f(d) k 2

f(d) k 1 poly(k, d)k d 2

Nd k c(d)k d 2

2 for some c(d), this implies that k d 2

4 poly(k 1) H(d) for some function H(d). It follows that k must satisfy k K(d) for some K(d). Although, the rate of the function K(d) does not follow from (Dai et al., 2016); we conjecture that K(d) behaves as a power of d.

Example 6 (High energy zonal harmonics) The properties discussed in this section indicate that high-energy only does not yield not-universal-approximability. As an extreme case, consider the case of a zonal harmonic f(x) .= P d k (w T x), for x, w Sd 1 where w is ﬁxed. Notice that f = 1. It holds that

γ1(f) = fk 1

|σk| O(k2d3/4) q

Nd k fk 1 O(k2d3/4)

2 = O(k2d3/4) ,

which implies universal approximability by Theorem 19. Similarly, polynomial combinations of zonal harmonics can be well approximated, as expected.

Remark 23 (Ridge functions) For a single neuron network f(x) = |w T x|, it holds f = 1 and f 2 = d 1/2. The spherical components of f are given by

fk(x) = Nd k h T h P d k (x T ) i (w) i = (σk Nd k )P d k (w T x) .

In particular, it holds

1 = γ1(f) =

k 0 even σ 1 k fk

k 0 even Nd k P d k (w T )

Therefore, understanding how tight (or strong) condition (9) is highly correlated with understanding convergence of the series P k 0 even Nd k P d k (w T ) 1, or equivalently, computing P d k µd,1.

Remark 24 While the result of this section mainly concern approximation by homogeneous one-hidden-layer networks with the Re LU (or absolute value) activation, they can easily be extended to any other activation satisfying Assumption 1, under the same assumptions. Moreover, notice that, thanks to Theorem 19, universal approximation by FRe LU,0 N is equivalent to universal approximation by H1 Hd 1.

Venturi, Jelassi, Ozuch, Bruna

Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectiﬁed linear units. ar Xiv preprint ar Xiv:1611.01491, 2016.

Kendall Atkinson and Weimin Han. Spherical harmonics and approximations on the unit sphere: an introduction, volume 2044. Springer Science & Business Media, 2012.

Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629 681, 2017.

Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930 945, 1993.

Monica Bianchini and Franco Scarselli. On the complexity of neural network classiﬁers: A comparison between shallow and deep architectures. IEEE transactions on neural networks and learning systems, 25(8):1553 1565, 2014.

Jean Bourgain, Joram Lindenstrauss, and Vitali Milman. Approximation of zonoids by zonotopes. Acta mathematica, 162(1):73 141, 1989.

Guy Bresler and Dheeraj Nagaraj. Sharp representation theorems for relu networks with precise dependence on depth. ar Xiv preprint ar Xiv:2006.04048, 2020.

Kaifeng Bu, Yaobo Zhang, and Qingxian Luo. Depth-width trade-oﬀs for neural networks via topological entropy. ar Xiv preprint ar Xiv:2010.07587, 2020.

John Charles Burkill. Lectures on approximation by polynomials, volume 16. Tata Institute of Fundamental Research, 1959.

Vaggos Chatziafratis, Sai Ganesh Nagarajan, Ioannis Panageas, and Xiao Wang. Depth-width trade-oﬀs for relu networks via sharkovsky s theorem. ar Xiv preprint ar Xiv:1912.04378, 2019.

Feng Dai and Yuan Xu. Approximation theory and harmonic analysis on spheres and balls, volume 23. Springer, 2013.

Feng Dai, Han Feng, and Sergey Tikhonov. Reverse h older s inequality for spherical harmonics. Proceedings of the American Mathematical Society, 144(3):1041 1051, 2016.

Amit Daniely. Depth separation for neural networks. In Conference on Learning Theory, pages 690 696. PMLR, 2017.

Costas Efthimiou and Christopher Frye. Spherical harmonics in p dimensions. World Scientiﬁc, 2014.

Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907 940. PMLR, 2016.

Ingo G uhring, Mones Raslan, and Gitta Kutyniok. Expressivity of deep neural networks. ar Xiv preprint ar Xiv:2007.04759, 2020.

Depth separation beyond radial functions

Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. ar Xiv preprint ar Xiv:1901.09021, 2019a.

Boris Hanin and David Rolnick. Deep relu networks have surprisingly few activation patterns. In Advances in Neural Information Processing Systems, pages 359 368, 2019b.

Daniel Hsu, Clayton Sanford, Rocco A Servedio, and Emmanouil-Vasileios Vlatakis Gkaragkounis. On the approximation power of two-layer networks of random relus. ar Xiv preprint ar Xiv:2102.02336, 2021.

Shirin Jalali, Carl Nuzman, and Iraj Saniee. Eﬃcient deep learning of gmms. ar Xiv preprint ar Xiv:1902.05707, 2019.

Jason M Klusowski and Andrew R Barron. Approximation by combinations of relu and squared relu ridge functions with ℓ1 and ℓ0 controls. IEEE Transactions on Information Theory, 64(12):7649 7656, 2018.

Shiyu Liang and Rayadurgam Srikant. Why deep neural networks for function approximation? ar Xiv preprint ar Xiv:1610.04161, 2016.

VE Maiorov and Ron Meir. On the near optimality of the stochastic approximation of smooth functions by neural networks. Advances in Computational Mathematics, 13(1): 79 103, 2000.

Eran Malach and Shai Shalev-Shwartz. Is deeper better only when shallow is good? In Advances in Neural Information Processing Systems, pages 6426 6435, 2019.

Eran Malach, Gilad Jehudai, Shai Shalev-Shwartz, and Ohad Shamir. The connection between approximation, depth separation and learnability in neural networks. ar Xiv preprint ar Xiv:2102.00434, 2021.

Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924 2932, 2014.

Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm inﬁnite width relu nets: The multivariate case. ar Xiv preprint ar Xiv:1910.01635, 2019.

Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of response regions of deep feed forward networks with piece-wise linear activations. ar Xiv preprint ar Xiv:1312.6098, 2013.

Philipp Petersen and Felix Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296 330, 2018.

Philipp Christian Petersen. Neural network theory, 2020. URL http://pc-petersen.eu/ Neural_Network_Theory.pdf.

Allan Pinkus. Approximation theory of the mlp model. Acta Numerica 1999: Volume 8, 8: 143 195, 1999.

Venturi, Jelassi, Ozuch, Bruna

Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5):503 519, 2017.

Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in neural information processing systems, pages 3360 3368, 2016.

Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2847 2854. JMLR. org, 2017.

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In International Conference on Machine Learning, pages 5301 5310. PMLR, 2019.

Donsub Rim, Luca Venturi, Joan Bruna, and Benjamin Peherstorfer. Depth separation for reduced deep networks in nonlinear model reduction: Distilling shock waves in nonlinear hyperbolic problems. ar Xiv preprint ar Xiv:2007.13977, 2020.

Theodore J Rivlin. An introduction to the approximation of functions. Courier Corporation, 1981.

David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. ar Xiv preprint ar Xiv:1705.05502, 2017.

Boris Rubin. Inversion of fractional integrals related to the spherical radon transform. journal of functional analysis, 157(2):470 487, 1998.

Itay Safran and Ohad Shamir. Depth-width tradeoﬀs in approximating natural functions with neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2979 2987. JMLR. org, 2017.

Itay Safran, Ronen Eldan, and Ohad Shamir. Depth separations in neural networks: What is actually being separated? ar Xiv preprint ar Xiv:1904.06984, 2019.

Matus Telgarsky. Beneﬁts of depth in neural networks. In Conference on learning theory, pages 1517 1539. PMLR, 2016.

Gal Vardi and Ohad Shamir. Neural networks with small weights and depth-separation barriers. ar Xiv preprint ar Xiv:2006.00625, 2020.

Gal Vardi, Daniel Reichman, Toniann Pitassi, and Ohad Shamir. Size and depth separation in approximating natural functions with neural networks. ar Xiv preprint ar Xiv:2102.00314, 2021.

Wolfgang Weil. Centrally symmetric convex bodies and distributions. Israel Journal of Mathematics, 24(3):352 367, 1976.

Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103 114, 2017.

Depth separation beyond radial functions

Appendix A. Proofs of depth-separation results

A.1 Proof of Theorem 5

The proof of the lower bound follows the same strategy as (Eldan and Shamir, 2016). For sake of simplicity in the following we remove the dimension d from the following notations: wd = w and vd = v. In the following we always assume d 3. Let S [d] a subset and let IS be the truncated identity matrix deﬁned as

s S ese s .

Moreover, deﬁne the function HS(x) as

i:i S 1xi>0 Y

j:j [d]\S 1xj 0 .

Lastly, for a subset S [d], let v S := v + ISw and deﬁne the function σr,S(x) := σr(v T Sx). Therefore, the expression of frd,w,v can be rewritten as:

frd,w,v(x) = X

S [d] g S(x) = X

S [d] HS(x)σrd,S(x)

where g S(x) := HS(x)σrd,S(x). Let the space of N-units one-hidden-layer networks be

f N : x Rr 7

k=1 σk(a T k x) : ak Rd, σk are 1-Lipschitz activations

Assume that

(A1) it holds that τd rd βdk for some constant k 1;

(A2) it holds that η > log2 ψ 1 p

Then, for large enough d, it holds

inf f FN frd,w,v f 2 ϕ 1 N 21 2ηK ψ 2 1 d O(d τd rd) , (11)

where we denote g 2 ϕ .= Z

Rd|g(x)|2ϕ2(x) dx

for g L2 ϕ2. In particular, if N poly(d), then the error (11) tends to 1 as d .

To show equation (11), we proceed as follows. Let F = {c fϕ : f F1}, and denote by F := \ ϕ frd,w,v = ˆfrd,w,v ˆϕ. Since ˆϕ has compact support in [ K, K]d and the Fourier transform of a one-unit shallow network f(x) = σ(x T a) has support in the line {ξ : ξ = αa, α R}, it follows that any function in F is supported in a tube T = {ξ : ξ = αa + [ K, K]d, α R} of radius K. For each tube T of radius K, we consider TT = {φ L2 : supp(φ) T} and κ .= sup T tube of radius K PTT (F) 2 ,

Venturi, Jelassi, Ozuch, Bruna

where PTT (F) = argminh TT h F 2 2. We claim that

inf f FN frd,w,v f 2 ϕ 1 Nκ2 . (12)

Indeed, given f FN, denote by T1, . . . TN the associated N tubes, and by TT1,...TN = L k [N] TTk the corresponding subspace spanned by TTk, k [N]. Then, by using the isometry of the Fourier transform, we have that

inf f FN f frd,w,v 2 ϕ = inf f FN c fϕ F 2 2

inf T1,...TN inf h TT1,...TN h F 2 2

= inf T1,...TN PTT1,...TN (F) F 2 2

= inf T1,...TN( F 2 2 PTT1,...TN (F) 2 2) . (13)

Now, observe that sup T1,...,TN PTT1,...TN (F) 2 2 N sup T PTT (F) 2 2. Equation (13) therefore becomes

inf f FN f frd,w,v 2 ϕ F 2 2 N sup T PTT (F) 2 2 ,

which proves (12) by plugging in the deﬁnition of κ and recalling that F 2 2 = frd,w,v 2 ϕ = 1 by Parseval. To establish (11), it is therefore suﬃcient to prove that

κ2 ψ 2 121 2ηK d O(d τd rd) . (14)

The rest of the proof will be devoted to establishing a suﬃciently sharp upper bound for PTT (F) 2. Observe that PTT (F) is simply obtained by setting to zero all frequencies of F outside T. We start by computing an upper bound on |F(ξ)|. We claim the following.

Lemma 25 It holds that

j=1 min 1, 2K π(|ξj ξS,j| K)+

Let D(ξ) .= P S DS(ξ), with DS(ξ) .= Qd j=1 min 1, 2K π(|ξj ξS,j| K)+

, so that from Lemma 25 we have |F(ξ)| 2 d ϕ 1D(ξ) . (16)

Recall that τd = sup S [d] v S . Given ξ non-zero, we claim the following.

Lemma 26 It holds that

D(ξ) CK,γ2d(1 η) min 1, 2K(π( ξ rdτd K)+) 1 , (17)

where CK,γ = 2 exp q

Depth separation beyond radial functions

Now, pick any arbitrary non-zero direction ν such that ν = 1. Let

T = {ξ : inf α R ξ αν K} (18)

denote the tube of radius K in the direction ν. It holds that Z

T D(ξ)2dξ = Z

T { ξ 2τdrd} D(ξ)2dξ

T { ξ >2τdrd} D(ξ)2dξ

In order to control the two terms t1 and t2, we use the following lemma to upper bound the measure of a ℓ -cylinder.

Lemma 27 Let T be an ℓ -tube of radius K as deﬁned in (18). If µ denotes the ddimensional Lebesgue measure, then

µ T [ R, R]d 8e2(d 1)(K + R)(2K)d 1 . (20)

Moreover, if g : R R is in L1(R) and non-increasing, then Z

T { ξ >R} g( ξ ) dξ 4e2(d 1)(2K)d 1 Z

R K(2+3/(d 1)) g(u)du , (21)

as long as R > K(2 + 3/(d 1)).

From (17) and (20), the ﬁrst term of (19) can be bounded as

t1 8e2C2 K,γ22d(1 η)+(d 1)Kd 1(d 1)(K + 2τdrd)

D(1) K,γ d (τdrd) 22(1 η)+1K d (22)

for D(1) K,γ = 16e2K 1C2 K,γ and d large enough, such that 2τdrd K. Similarly, using (21), the second term t2 in turn can be bounded as

t2 8e2π 2C2 K,γd 22(1 η)+1K d Z

2τdrd K(2+3/(d 1)) (u τdrd K) 2 du

= 8e2π 2KC2 K,γd 22(1 η)+1K d (τdrd 3K(1 + 1/(d 1))) 1

D(2) K,γ d 22(1 η)+1K d , (23)

for D(2) K,γ = 16e2π 2C2 K,γ and and d large enough, such that τdrd 10K. Thus, collecting (22) and (23) and using (16), we obtain Z

T |F(ξ)|2dξ ϕ 2 1 2 2d(t1 + t2)

d ϕ 2 1 21 2ηK d D(1) K,γτdrd + D(2) K,γ

DK,γ d ϕ 2 1 21 2ηK d max(1, τdrd) ,

Venturi, Jelassi, Ozuch, Bruna

DK,γ .= D(1) K,γ + D(2) K,γ = 32 exp

! π 2 + K 1 .

It follows that

PTT (F) 2 2 = Z

T |F(ξ)|2dξ DK,γ (d τd rd) ψ 2 1 21 2ηK d ,

as long as d β 1 max(1, 10K) 1/k (where β and k satisfy τdrd βdk). We have just established (14), and this concludes the proof of the theorem. In the remaining part of this section we prove the auxiliary lemmas used above. Proof [Proof of Lemma 25] We start by computing ˆfrd,w,v. From the deﬁnition of σr, it follows that ˆσr,S(ξ) = δ(ξ rv S) ,

which combined with the deﬁnition of H yields

ˆfrd,w,v(ξ) = X

ˆHS ˆσrd,S (ξ) = X

ˆHS(ξ rdv S) .

Let ξS .= rdv S. It holds that

Rd ˆfrd,w,v(ν) ˆϕ(ξ ν) dν = X

Rd ˆHS(ν ξS) ˆϕ(ξ ν) dν

Rd ˆHS(ν) ˆϕ(ξ ξS ν) dν | {z } .=FS(ξ ξS)

We can now bound each term FS separately. It holds that

FS(ξ) = Z ˆHS(ν) ˆϕ(ξ ν)dν = Z HS(x)e2iπξT xϕ(x)dx =

j=1 Fj(ξj) (25)

where Fj(t) = Z

R 1{ϵjx > 0}e2iπtxψ(x) dx , (26)

with ϵj = 1. Assume without loss of generality that ϵj = 1. Observe that Fj = ˇQ, where

Q(u) = 1{u > 0}ψ(u) .

Since ψ L1(R) and its Fourier transform ˆψ has compact support in [ K, K], it holds that

| ˆψ(τ)| ψ 1 for τ [ K, K] and ˆψ(τ) = 0 for |τ| > K . (27)

On the one hand, since ψ is even, it holds, by directly bounding (26), that

R |ψ(u)|du = 1

2 ψ 1 for all t ,

Depth separation beyond radial functions

and from (27) and the Hilbert transform of Q we deduce on the other hand that

|Fj(t)| = 1

ˆψ(τ) t τ dτ

2K ψ 1 (2π)(|t| K) for |t| > K ,

so that it follows that

|Fj(t)| ψ 1

2 min 1, 2K π(|t| K)+

Thus, from equations (24), (25) and (28) it follows that

S [d] |FS(ξ ξS)|

j=1 min 1, 2K π(|ξj ξS,j| K)+

which proves Lemma 25.

Proof [Proof of Lemma 26] Let deﬁne for any ξ Rd and λ > 0

n(ξ, λ) .= |{j [d] : |ξj| > λ}| .

Recall that v S = v + ISw and ξS = rdv S. Observe that ξS ξS = rd(IS IS )w, so

|ξS,j ξS ,j| = rd|wj| if j (S S ) \ (S S ) 0 otherwise . (29)

If d(S, S ) denotes the Hamming distance between two subsets S, S , then for all S, S , the following holds.

Lemma 28 It holds that

n(ξS ξS , γd2) = d(S Ωd, S Ωd) . (30)

This immediately implies that

n ξ ξS, γd2

+ n ξ ξS , γd2

d(S Ω, S Ω) for all ξ and S = S . (31)

Indeed, if that was not the case, applying the triangle inequality coordinate-wise would contradict equation (30). The ﬁrst upper bound is obtained by ﬁrst noticing that, for d > 2 p

K/γ, it holds

DS(ξ) π(γd2/2 K)/(2K) n(ξ ξS,γd2/2) for all S and ξ .

Now, deﬁning S ξ = arg min S [d] n(ξ ξS, γd2/2), from (31) it follows that

n(ξ ξS, γd2/2) d(S Ωd, S Ωd)

2 for all S = S ξ

Venturi, Jelassi, Ozuch, Bruna

and thus, for d > 2 p

K/γ, it holds

D(ξ) = DS ξ(ξ) + X

S =S ξ DS(ξ)

S : d(S Ωd,S ξ Ωd)=s

π(γd2/2 K)/(2K) s/2

DS ξ(ξ) + 2d |Ωd| |Ωd| X

π(γd2/2 K)/(2K) s/2

1 + 2d |Ωd|

π(γd2/2 K)/(2K)

CK,γ2d(1 η) (32)

since |{S : d(S Ωd, S ξ Ωd) = s}| 2d |Ωd| |Ωd| s . The term CK,γ is a constant that

depends only on K and γ; in particular, we can choose CK,γ = 2 exp q

πγ . The second upper bound is obtained using the above argument as follows. Let qξ = arg maxj |ξj|. Since ξS rdτd for any S [d], it holds that

2K π(|ξqξ ξS,qξ| K)+ Y

j =qξ min 1, 2K π(|ξj ξS,j| K)+

2K(π( ξ τdrd K)+) 1 X

j =qξ min 1, 2K π(|ξj ξS,j| K)+

CK,γ2K(π( ξ τdrd K)+) 1 2d(1 η) (33)

by noticing that the argument leading to (32) can now be repeated for the (d 1)-dimensional vector ˇξ = (ξ1, . . . , ξqξ 1, ξqξ+1, . . . ξd), so that

n(ˇξ ˇξS, γd2/2) d((S Ωd) \ {qξ}, (S Ωd) \ {qξ})

2 for all S = S ξ

which proves (33) and concludes the proof of Lemma 26.

Proof [Proof of Lemma 28] In fact, it holds that the two sets A1 := {j [d] : |ξS,j ξS ,j| γd2} and A2 := {j [d] : j (S Ωd)\(S Ωd)} are equal. Let j A1. Then |ξS,j ξS ,j| > γd2. Since this quantity is nonzero, equation (29) indicates that therefore j S\S without loss of generality. Moreover, |ξS,j ξS ,j| = rd|wj| which implies that rd|wj| > γd2 and j Ωd. We conclude that j (S Ωd)\(S Ωd) which implies that j A2. Now, let j A2. Then, without loss of generality, j (S Ωd)\(S Ωd). Then, it holds r|wj| > γd2 since j S\S according to (29) and |ξS,j ξS ,j| = rd|wj|. Combining these two facts, it follows that |ξS,j ξS ,j| > γd2 which means that j A2.

Depth separation beyond radial functions

Proof [Proof of Lemma 27] Let

TR(ν) = T(ν) [ R, R]d

= {ξ : inf α R sup j [d] |ξj ανj| K and ξ R} .

The aim is to upper bound the volume of TR(ν) for any ν. Assume, without loss of generality, that ν = 1. The cut-oﬀtube TR(ν) can be covered with ℓ -balls of radius K = ϑK centered along the ray deﬁned by ν, that is

jsν + [ ϑK, ϑK]d . (34)

Now, we optimize both the sampling rate s (0, K) and the radius ratio ϑ 1 while satisfying (34). Given s, let us ﬁrst compute the smallest admissible ϑ. Any x TR(ν) satisﬁes

x (j + y)sν K

for some j N and |y| < 1. This implies that x jsν K + ys K + s. Therefore an admissible ϑ is given by the solution of K + s = ϑK, that is ϑ = 1 + s K 1. Now, the volume of

jsν + h 1 + s

is upper bounded by

l(s) .= 4K + R

s (2(K + s))d .

Minimizing over s gives s = K d 1. Therefore, for all ν Rd, it holds

TR(ν) (K + R)Kd 1(d 1) 1 + 1 d 1

d (K + R)(d 1)Kd 1e2 ,

which proves (20). Equation (21) is established analogously. Let T>R(ν) = T(ν) {ξ : ξ > R}. Then we have that

jsν + [ (K + s), (K + s)]d ,

Venturi, Jelassi, Ozuch, Bruna

where we set s = K/(d 1). Since g is non-increasing, it follows that

T>R(ν) g( ξ ) dξ X

ξ jsν K+s g( ξ ) dξ

2(2(K + s))d X

s g(js (K + s))

2(2(K + s))d X

(j 1)s (K+s) g(u) du

2(2(K + s))d

R K 2s (K+s) g(u) du

2e2(d 1)(2K)d

R K(2+3/(d 1)) g(u) du .

This establishes (21) and concludes the proof.

A.2 Proof of Theorem 10

The proof consists in approximating the activation σr using Assumption 1.2 on σ. Since σr is (2πr)-Lipschitz, we obtain that there exists, for any r, Q > 0, αk, βk R such that over the interval [ Q, Q] it holds

k=1 αkσ(t βk)

k=1 αkσ(t βk)

1 + 2Qr/N for t R .

Let f N Fσ N be deﬁned as

k=1 αkσ rd v T d x + w T d x+ βk

Now, let γd = vd 1 + wd 1 and Qd = Qd

γd , so that by deﬁnition when x Qd it holds that

|v T d x + w T d x+| Qd .

Depth separation beyond radial functions

The approximation error can be decomposed as follows:

Rd(frd,wd,vd(x) f N(x))2ϕ(x)2 dx =

x Qd (frd,wd,vd(x) f N(x))2ϕ(x)2 dx + Z

x > Qd (frd,wd,vd(x) f N(x))2ϕ(x)2 dx

4Q2 dr2 d N2 ϕ 1Bd Qd, 2 2 + 4 1 + Qdrd

2 ( ϕ 2 2 ϕ 1Bd Qd, 2 2)

4 Q2 dγ2 dr2 d N2 ϕ 2 2 + 4 1 + Qdrd

2 1 (1 α Q 1 d )d

4 Q2 dγ2 dr2 d N2 + 16αd Q 1 d

since |ψ(x)|2 α|x| 2/2 for some α > 0, as long as Qd > α and N > rd Qd. Optimizing this upper bound with respect to Qd gives

Qd = 2αd N2

which results in

fr,w,v f 2 ϕ dγdrd

as long as N > αrdγd. This concludes the proof.

Appendix B. Proofs of poly(d) upper bounds

B.1 Proof of Lemma 3

We show this for the case L(f(d)) = 2, but the proof it is analogous for the other cases. The function f(d) has the form

f(d)(x) = γT d ρ2(Wdρ1(Udx))

where ρ(d) 1 , ρ(d) 2 are component-wise activations satisfying Assumption 1, and γd Rqd, W Rqd pd, U Rpd d, with

pd, qd, γ , W F, , U F, poly(d) .

Thanks to Assumption 1.2, there exists A RNpd d, B Rpd Npd, c RNpd such that

γT ρ2(Wρ1(Ux)) γT ρ2(WB σ(Ax + c)) ϵ

and N, c , B F, , A F, ϵ 1 poly(d) .

Venturi, Jelassi, Ozuch, Bruna

Let K1 = {Bσ(Ax + c) : x K}; it holds diam(K1) poly(d). Similarly as before, we get that there exists D RMqd pd, E Rqd Mqd, f RMqd such that

γT ρ2(Wy) γT E σ(Dy + f) ϵ

and M, f , E F, , D F, ϵ 1 poly(d) .

By calling γ = ET γ, W = DWB and U = UA, we get that

gσ(x) .= γT σ( Wσ( Ux + c) + f)

satisﬁes the statement of the theorem.

B.2 Preliminary lemmas

The ﬁrst lemma is a known results in approximation theory.

Lemma 29 (Jackson s Theorem, Theorem 1.4 in (Rivlin, 1981)) Let f : [a, b] R with modulus of continuity ω. Then there exists a polynomial pn(t) = Pn k=0 pktk, pk R, such that

sup t [ r,r] |f(t) pn(t)| 6 ω b a

The next lemma yields a worst approximation rate but allows us to control the coeﬃcients of the polynomial. It is a small modiﬁcation of Lemma 4 in (Safran et al., 2019).

Lemma 30 Let f : [ r, r] R (1, α)-Holder. Then for any ϵ > 0 there exists a polynomial

pn(t) = Pn k=0 rktk, rk R, of degree n = 4 1 α rα

sup t [ r,r] |f(t) pn(t)| ϵ .

Moreover, pn can be chosen such that |rk| 2nrα k, k [n], and |r0| rα + |f(0)|.

Proof [Proof] Notice that we can assume f(0) = 0 without loss of generality. Deﬁne g(t) = f(r(2t 1)) for t [0, 1] and notice that g is ((2r)α, α)-Holder. Also, deﬁne the n Bernstein polynomial bn,i, i [0, n], as

bn,i(t) = n i

for t [0, 1]. Notice that they form a partition of unity. We deﬁne

Depth separation beyond radial functions

We have that

|gn(t) g(t)|

i=0 bn,i(t) g(t) g i

n t|<ϵ bn,i(t) g(t) g i

n t| ϵ bn,i(t) g(t) g i

n t| ϵ bn,i(t) ϵα + rα

In particular rα 2nϵ2 ϵα if

If we deﬁne pn(t) = gn t

2 , then we have that

sup x [ r,r] |f(t) pn(t)| ϵ

Finally, we want to upper bound the coeﬃcients of pn. Notice that we have

pn(t) = (2r) n n X

(t + r)i(t r)n i .

It follows that the coeﬃcients of pn can be bounded by those of

(t + r)n rα n(t + r)n .

Let rk the k-th coeﬃcients of rα n(t + r)n. Then

rk = rα n n k

rn k 2nrα k .

This concludes the proof.

B.3 Approximation by shallow Fourier neural networks

We start by reporting a known result ((Burkill, 1959), Theorem 18).

Lemma 31 Let g : [ π, π] R 2π-periodic with modulus of continuity ω. Then there exists a trigonometric polynomial qn(t) = Pn k= n bkeikt, bk C, with real values (i.e. qn(t) R for all t [ π, π]), such that

sup t [ π,π] |g(t) qn(t)| 2

2 + ω(π) log ω 2

Venturi, Jelassi, Ozuch, Bruna

Moreover, it holds that

π |g(t)| dt .

Proof [Proof] The polinomyal qn is given by the Fejer sum of the Fourier series of g, that is

k= j ˆgkeikt =

where ˆgk = 1

π g(t)e ikt dt .

The proof of the upper bound can be found in (Burkill, 1959), Theorem 18. Finally, notice that qn is real-valued since

ˆgkeikt + ˆg ke ikt = 2Re ˆgkeikt

because ˆg k = ˆgk since g takes values in R.

The above result immediately implies a convergence rate for univariate approximation by shallow Fourier networks (that is, with activation σ1(t) = e2πit).

Lemma 32 Let f : [ r, r] R be L-Lipschitz. Then there exists a real-valued Fourier shallow network qn(t) = Pn k= n bkeiwkt, bk C, wk R, such that

sup x [ r,r] |f(x) qn(x)| 3 1 + 2L2r2 log n

for any n 2. Moreover qn can be chosen such that |wk| π|k|

r and |bk| f for any k [ n, n].

Proof [Proof] Assume, w.l.o.g., that f(r) f( r) (otherwise we can consider f( x) in place of f(x)). First, we want to transform f into a 2-pi periodic function on [ π, π]. To do this we consider g deﬁned as

L(x + r) + f( r) if x r c 2L, r

f(x) if x [ r, r] L(x r) + f(r) if x r, r + c 2L

where c = f( r) f(r). Notice that g is L-Lipschitz and 2 r + c 2L -periodic. Finally, let g : [ π, π] R deﬁned as

g(x) = g 2Lr + c

We have that g is 2π-periodic and ℓ-Lipschitz for

Depth separation beyond radial functions

Therefore, we can apply Lemma 31 to g. This gives us a (real-valued) trigonometric polynomial rn(t) = Pn = n bkeikt such that

sup x [ π,π] |g(x) rn(x)| 4ℓ

2 + ℓπ log 2ℓ

3 1 + 2L2r2 log n

n for n 2. Since

sup x [ r,r]

ℓx sup x [ r c

ℓx = sup x [ π,π] |g(x) rn(x)|

the thesis follows.

To conclude we make some remarks about shallow Fourier networks. Note that a generic shallow Fourier network f N with N units can be represented as

k=1 ukeiw T k x . (35)

Indeed we have that

k=1 ukei(w T k x+bk) + b =

ukeibk eiw T k x + b ei0T x

for any b, bk C. Let Ff N be the space of networks as in equation (35). Notice that a universal approximation theorem holds for shallow Fourier networks as well. This is because the universal approximation theorem holds for shallow networks with activation σ(t) = cos(t) and since cos(t) = eit + e it /2, the thesis follows. Finally, the following lemma will be used in the proof of Theorem 11.

Lemma 33 If f is a (real-valued) shallow Fourier neural network, then so is fk, for k non-negative integer. Moreover, if f has n units, then the number of units of fk is upper bounded by n + k 1 k

Proof [Proof] Let f(x) = Pn j=1 ujeiw T j x be a shallow Fourier neural network. Then, by the multinomial formula, we have that

j=1 ujeiw T j x

k p1, . . . , pn

upj j eiw T j x pj

k p1, . . . , pn

ei( Pn j=1 pjwj) T x .

Clearly, if f is real-valued, so is fk. Finally notice that by the formula above, the number of units of fk is upper bounded by |{(p1, . . . , pn) : p1 + + pn = k}|.

Venturi, Jelassi, Ozuch, Bruna

B.4 poly(d) upper bounds for two-hidden-layers networks

Consider a two-hidden-layers neural network f deﬁned as

f : x Rd 7 γT g WT h UT x C ,

where h : Rp Rp and g : Ro Ro are, respectively, component-wise 1-Lipschitz and (1, α)-Holder activation functions, and U Rd p, W Rp o, γ Co. We wish to approximate f with a one-hidden-layer neural network with a given activation σ satisfying Assumption 1.2, for some constant νσ > 0. We start by proving a result for approximation by shallow Fourier networks at a poly(d) rate.

Proposition 34 Let K Rd be a compact set. There exist f N Ff N such that f ff N K, ϵ

ν=1 bνeiv T ν x ,

for N = (2np + 1)m

& 9 4 1 α γ 2 1 W 2 (1 + 2C2)2

where we denoted

C = sup x K UT x and M = sup x K

WT h UT x .

Moreover ff N can be chosen such that it holds

v T ν x πmn and |bν| 2 γ 1

(4np H W F, )m (36)

where H = supx [ C,C]d h(x) .

Proof [Proof] Let qj n given by Lemma 32 to approximate hj over [ C, C] and

q(n) k (x) =

j=1 wk,jqj n(u T j x)

for k [o]. We have that

q(n) k (x) w T k h UT x

j=1 |wk,j| qj n(u T j x) hj(u T j x)

3 W 1 + 2C2 log n

n .= W (1 + 2C2)ϵn

Depth separation beyond radial functions

for x K. It holds that q(n) k is a real-valued shallow Fourier network with (2n 1)p terms and ﬁrst layers weights given by πk

C uj for k [ (n 1), n 1]. Moreover, it holds that q(n) k (x) q(n) k (x) w T k h UT x + w T k h UT x W (1 + 2C2)ϵn + M .= L .

Let pk m(t) = Pm h=0 βk hth given by Corollary 3 to approximate gk over the interval [ L, L] and ϵm the relative error. Let then

k=1 γkpk m(qn k(x)) .

It holds that

|f(x) fn,m(x)|

k=1 |γk| gk(w T k h(UT x)) pk m(q(n) k (x))

k=1 |γk| gk(w T k h(UT x)) gk(qn k(x)) +

k=1 |γk| gk(q(n) k (x)) pk m(q(n) k (x))

γ 1 sup k [o]

w T k h(UT x) q(n) k (x) α + γ 1ϵm

γ 1 W α (1 + 2C2)αϵα n + γ 1ϵm .

It holds that γ 1 W α (1 + 2C2)αϵα n ϵ

2 as long as

n 9 4 1 α γ 2 1 W 2 1 + 2C2 2

ϵ 2 α . (37)

Similarly γ 1ϵm ϵ

2 as long as

α W (1 + 2C2)ϵn + M .

Moreover, by Lemma 30, pk m(t) = Pm h=0 βk hth can be chosen with

1 α 1 Lα = 2 16 1 α

1 α 1 W (1 + 2C2)ϵn + M α

such that its coeﬃcients βk h, k [m], are bounded by

|βk| max n 2m Lα k, Lα + |g(0)| o 2m(1 + Lα) + |g(0)|

= 2m 1 + W (1 + 2C2)ϵn + M α + |g(0)| .

Venturi, Jelassi, Ozuch, Bruna

Notice that we can assume g(0) = 0 without loss of generality. Therefore

sup x K |f(x) fn,m(x)| ϵ

as long as (37) holds and

α " ϵ 2 γ 1

1 + M 2 γ 1

If we further assume that

we can also assume that

βk h 2 1+ 2 16 1 α

for k [m]. Finally, notice that, by Lemma 33, fn,m is a shallow Fourier neural network with number of units upper bounded by

(2n 1)p + k 1 k

= (2n 1)p + m m

m!((2n 1)p + k + m) ((2n 1)p + 1)

((2n 1)p + 1)m .

Therefore, it holds that inf f N Ff N sup x K |f(x) f N(x)| ϵ

as long as N (2np + 1)m

with n and m given by (37) and (38) respectively. Finally, notice that the ﬁrst layer weights of fn,m are given by p X

k= (n 1) sk,j πk

over all non-negative integers sk,j such that Pp j=1 Pn 1 k= (n 1) sk,j m. Therefore, if

ν=1 bνeiv T ν x ,

then v T ν x mπ(n 1)

C max j [p]|u T j x| mnπ .

Depth separation beyond radial functions

On the other hand, the coeﬃcients bk have the form

k=1 γkβk h wk,j(qj n)l sl,j

for all non-negative integers s = (sl,j)l,j such that Pp j=1 Pn 1 l= (n 1) sl,j = h m, where

(qj n)l denotes the l-th coeﬃcients of qj n. By Lemma 31, we know that (qj n)l sup t [ C,C] |hj(t)| .

|bν| ((2n 1)p)h sup t [ C,C] |hj(t)|sl,j o X

k=1 |γk||βk h||wk,j|sl,j

[(2n 1)p H W F, ]m γ 1 β F, .

This concludes the proof.

We can now conclude with a detailed version of Theorem 11.

Theorem 35 Let K be a compact set and

C = sup x K UT x , M = sup x K

WT h UT x and H = sup x [ C,C]d h(x) .

It holds that inf fσ N Fσ N f(x) fσ N(x) K, ϵ

ϵ γ 1mn(4np + 1)2m(H W F, )m "

n = 9 4 1 α γ 2 1 W 2 (1 + 2C2)2

ϵ 2 α and m = 2 16 1 α

Moreover, it is possible to choose fσ N such that m (fσ N) (1 + N2).

Proof [Proof] Let f N given by Proposition 34 such that

sup x K |f(x) f N(x)| ϵ

We know that

k=1 bkeiv T k x = fc N(x) + ifs N(x)

Venturi, Jelassi, Ozuch, Bruna

k=1 bk cos(v T k x) and fs N(x) =

k=1 bk sin(v T k x)

and |bk| B and v T k x V for x K, where B and V are given by (36). Using the assumption on σ, we know that, for each k [N], there exist shallow networks fc k and fs k with activation σ and number of units

fc k(x) cos(v T k x) ϵ 4NB and sup x K

fs k(x) sin(v T k x) ϵ 4NB .

Letting f N (x) = PN k=1 bkfc k(x) + i PN k=1 bkfs k(x) it holds that

sup x K |f N (x) f N(x)| sup x K

k=1 bk fc k(x) cos(w T k x) + sup x K

k=1 bk fs k(x) sin(w T k x)

k=1 |bk| sup x K

fc k(x) cos(w T k x) +

k=1 |bk| sup x K

fs k(x) sin(w T k x)

NB ϵ 4NB + NB ϵ 4NB = ϵ

which implies that sup x K |f N (x) f(x)| ϵ .

Moreover notice that we can assume that all second layer weights of f N are real; indeed, if this is not the case, one can replace them by the real part, and upper bound above can only decrease. Finally, we have that the number of units of f N is given by

Applying Proposition 34 concludes the proof.

B.5 Proofs of special cases

B.5.1 Radial functions

Let f(x) = ϕ( x ) with ϕ 1-Lipschitz. Then it holds that f(x) = g(1T h(x)) where g(t) = ϕ(

t) and h : Rd Rd is deﬁned as hi(x) = x2 i . Clearly, supx Bd 1,2 x = 1,

supx Bd 1,2 1T h(x) = supx Bd 1,2 x 2 = 1 and supx [ 1,1]d h(x) = supx [ 1,1]|x|2 = 1. Moreover, 1 1 = d and g is (1, 1/2)-Holder. Then, by applying Theorem 35, we get the following.

Depth separation beyond radial functions

Corollary 36 (Radial functions) It holds that

inf fσ N Fσ N fσ N f Bd 1,2, ϵ

N νσα d2 (4 + ϵ)2

where α > 0 is a numerical constant.

B.5.2 Shallow approximation of piece-wise oscillatory functions

Consider fw,U : x Rd 7 eiw T (Ux)+ for some w Rp, U Rp d. Then Theorem 35 implies the following.

Corollary 37 (Approximation of (2) by shallow networks) It holds that

inf fσ N Fσ N fw,U fσ N Bdr,p, ϵ

ϵ6 (2 + ϵ + 2r w 1 U p, )2

ϵ2 + 1 2# α

ϵ2 (ϵ+2r w 1 U p, )

where β = α w 2 1 1 + 2r2 U 2 p, 2 and α is a numerical constant.

B.5.3 Approximation bounds under the Gaussian metric

For sake of simplicity in this section we consider approximation bounds for the function of interest fw,U : x Rd 7 eiw T (Ux)+

for some w Rp, U = [u1| |up]T Rp d. Notice that the following results can be naturally extended to any three-layer network target. We are interested in upper bounding the error

inf f N Ff N

E|fw,U(X) f N(X)|2 1

where the expectation is taken over X N(0, σ2I). For sake of simplicity of notation, we denote

f g σ,2 .= E|f(X) g(X)|2 1

It is a well known fact that Gaussian vectors concentrates in a ball of radius

d. We recall a quantitative version of this fact in the following.

Lemma 38 Let X N(0, σ2I) a d-dimensional Gaussian vector. Then it holds that

d + t o e t2

Venturi, Jelassi, Ozuch, Bruna

Thanks to Proposition 34, the following holds.

Lemma 39 Let r > 0. Then it holds that

inf f N Ff N f N fw,U Bd r,2, δ (39)

as long as N (2np + 1)m

where n = 36

δ2 w 2 1 1 + r2 U 2 2, 2 and m 16

δ3 (δ + 2r w 1 U 2, ) .

Moreover, under the same assumption, we can also assume that the function f N that satisﬁes (39) also satisﬁes

f N N(2 + δ + 2r w 1 U 2, )(4npr w U 2, )m .

Thanks to these two lemmas, the following proposition follows.

Proposition 40 Let σ = d 1/2 and assume that U 2, 1. Then it holds

inf f N Ff N f N fw,U σ,2 ϵ (40)

(1 + w s 1) K(1+( log p

d ) s)(1+ 1

ϵs )(1+ w s 1)

where K > 0 and s 1 are some numerical constant.

Proof [Proof] Let c = w 1. First, notice that fw,U = 1. Let χr(x) = 1{ x 2 r} and f N given by Lemma 39 for a certain δ > 0. Then it holds that

f N fw,U σ,2 (f N fw,U)(1 χr) σ,2 + (f N fw,U)χr σ,2 f N fw,U Bd r,2, + P( x 2 > r)( f N + fw,U ) .

If r = 1 + t for t > 0, it follows

f N fw,U 2,σ δ + e dt2

2 (1 + f N )

δ2 c2 1 + r2 2 + 1 1

δ3 (δ+2rc) .

Moreover, one can assume

f N (2 + δ + 2rc) 72p

δ2 c2 1 + r2 2 + 1 16

δ3 (δ+2rc) 144pr

δ2 c3 1 + r2 2 16

(2 + δ + 2rω) 144 p

δ2 ω3r(1 + r2)2 + 1 32

Depth separation beyond radial functions

where ω = max(1, c). Let δ = ϵ

2. If t 1, it holds that

f N (4ω + ϵ + 2ωt) 576 p

ϵ2 ω3(1 + t) 1 + (1 + t)2 2 + 1 256

ϵ3 (ϵ+2ω+2ωt)

K(ϵ + ω + ωt) K p

ϵ2 ω2t5 + 1 K

ϵ3 (ϵ+ω+ωt) .

In the equation above above and in the following, K denotes a (large enough) numerical constant. Therefore

2 (1 + f N ) ϵ

2 log 1 + K(ϵ + ω + ωt) K p

ϵ2 ω2t5 + 1 K

ϵ3 (ϵ+ω+ωt) + log ϵ

Since log(1 + Csα) log(1 + C) + α log(s) if s 1, C > 0 and α > 0, the above is implied by

2 log(1 + K(ϵ + ω + ωt)) K

ϵ3 (ϵ + ω + ωt) log K p

ϵ2 ω2t5 + 1 + log ϵ

Since log(1 + K(ϵ + ω + ωt)) K(ϵ + ω + ωt)

ϵ2 ω2t5 + 1 log 1 + K pω2

+ 5 log t log 1 + K pω2

equation (41) holds if dt2

2 α βt1/2 γt ηt3/2 0

α = K(ϵ + ω) + K

ϵ3 (ϵ + ω) log 1 + K pω2

ϵ3 (ϵ + ω) > 0 ,

γ = Kωt + K

ϵ3 ω log 1 + K pω2

ϵ3 ωt > 0 .

It follows that eq. (41) holds if

t 1 + 4 α + β + γ + η

Venturi, Jelassi, Ozuch, Bruna

It follows that the error bound (40) holds as long as

ϵ2 (1 + c)2

1 + 4 α + β + γ + η

K ϵ3 ϵ+c 1+( α+β+γ+η

The thesis follows.

B.6 Extension to generic L-layers networks

The results presented in the previous section can be generalized to hold for approximating generic multi-layer neural networks. In this section we present an analogous result to Theorem 11 for this more general case. Consider a multi-layer neural network f deﬁned as

f : x Rd x(L)(x) C

where x(L) is deﬁned by recursion by x(0)(x) = x,

x(k)(x) = σ(k)(A(k)x(k 1)(x)) for k [L] and x(L+1)(x) = h a(L+1)i T x(L)(x) ,

where A(k) = [a(k) 1 | |a(k) dk ]T Rdk dk 1 for k [L] (with d0 = d), a(L+1) Cd L and σ(k) : Rdk Rdk are 1

6-Lipschitz component-wise activation functions and verify σk(0) = 0 for k [L]. In the following we also assume that A(k) 1 for k [L] and a L+1 1 1. Note that these assumption can easily be relaxed, but we adopt them here for sake of simplicity.

Proposition 41 Let f as above. It holds that

inf f N Ff N f f N Bd 1, , ϵ

N 2LC 1 + 1

where C is a numerical constant.

Before proving the above proposition, we prove two preliminary lemmas.

Lemma 42 Let W = {wℓ}ℓ [K] Rd and h : Rd Rp such that hj is a shallow Fourier neural networks with ﬁrst layer weights given by W, for all j [p]. Consider q : Rp Rm

of the form q(x) = Bσ(x)

where σ : Rp Rp is a component-wise polynomial activation function of degree at most D and B Cm p. Then there exists V Rd ﬁnite such that f .= q h is such that fj is a Fourier neural nets with ﬁrst layer weights given by V for each j [p] and such that

|V| (2K)D .

Depth separation beyond radial functions

Proof [Proof] The functions fj have the form

l=0 αk,l(hk(x))l =

ν=1 βk,νeiw T ν x !l

By Lemma 33, we see that each fj is a Fourier neural network with the same set of ﬁrst layer weights of size at most

(K + 1)D (2K)D .

This concludes the proof.

Lemma 43 Consider the same assumption as Proposition 41. Then, there exists a polynomial f N1,...,NL : x Rd y(L+1)(x) C

given by the recursion y(0)(x) = x,

y(k)(x) = pk Nk(A(k)y(k 1)(x)) for k [L]

y(L+1)(x) = h a(L+1)i T y(L)(x)

where pk Nk are component-wise polynomial activation functions of degree Nk, such that

f f N1,...,NL Bd 1, , ϵ (42)

as long as Nk L

ϵ +(L 1) for k [L]. In particular, f is a polynomial of degree QL k=1 Nk.

Proof [Proof] We can show this by induction over L. First, consider the case L = 1. By Lemma 29, for each j [d1], there exist polynomials p N,j : R R of degree N which verify

p N,j((a(1) i )T x) σ(1) j ((a(1) j )T x) 1

since (a(1) i )T x 1 by assumption. Since a(2) 1 1, it follows that

(a(2))T p N(A(1)x) (a(2))T σ(1)(A(1)x) 1

This implies the thesis for the case L = 1. Now consider the induction step, that is, assume that, for every δ > 0 and j, there exists a certain fj N1,...,NL 1 such that

x(L 1) j (x) fj N1,...,NL 1(x) δ

Venturi, Jelassi, Ozuch, Bruna

as long as Nk L 1

δ + (L 2) for k [L 1]. Notice that this implies that (a(L) j )T f N1,...,NL 1(x) 1 + δ ,

where f N1,...,NL 1 = (f1 N1,...,NL 1, . . . , fd L 1 N1,...,NL 1). Therefore for each j [d L], by Lemma 29, there exist polynomials p N,j of degree N such that

p N,j((a(L) j )T f N1,...,NL 1(x)) σ(L) j ((a(L) j )T f N1,...,NL 1(x)) 1 + δ

Let then f N1,...,NL 1,N be deﬁned as

f N1,...,NL 1,N(x) =

j=1 a(L+1) j p N,j((a(L) j )T f N1,...,NL 1(x)) .

Since a(L+1) 1 1, it holds that f N1,...,NL 1,N(x) f(x) f N1,...,NL 1,N(x) a T L+1σL+1 f N1,...,NL 1,N(x)

+ a T L+1σL+1 f N1,...,NL 1,N(x) f(x)

L ϵ then equation (42) holds as long as

L ϵ ϵ L = L

ϵ + (L 1) .

This concludes the proof of the lemma.

Proof [Proof of Proposition 41] It holds that

f(x) = g(σ(1)(A(1)x))

where g is a (L 1)-hidden-layers neural network with input dimension d1. By Lemma 31, for every δ > 0 and j [d1], there exists Fourier networks q N1,j(x) with 2N1 1 units such that σ(1) j ((a(1) j )T x) q N1,j((a(1) j )T x) C N1

where C > 0 is a numerical constant. Notice that this implies that, for N1 4C2, it holds q N1(A(1)x) 1 .

Now, we can approximate g with a polynomial neural network g NL,...,N2 as given by Lemma 43. In particular, for any δ > 0, there exist g NL,...,N2 such that

sup x [ 1,1]d|g NL,...,N2(x) g(x)| δ

Depth separation beyond radial functions

as long as Nk L 1

δ + (L 2) for k [2, L]. It follows that

g NL,...,N2(q N1(A1x)) f(x) δ + C N1 .

Let f N(x) = g NL,...,N2(q N1(A(1)x)). By choosing δ = ϵ/2, it holds that

sup x [ 1,1]d|f N(x) f(x)| ϵ

as long as Nk 2L 1

ϵ + (L 2) for k [2, L] and N1 C2 1 + 4

ϵ2 . We claim that f N is a Fourier network with at most N = 2LN1d1 QL k=2 Nk (43)

units. We can prove this by induction over L 2. Remember that g NL,...,N2 is is the form

g NL,...,N2(x) = h a(L+1)i T g L NL A(L)g L 1 NL 1

A(L 1) g2 N2(A(2)x)

where gk Nk is a component-wise polynomial of degree at most Nk, for k [2, L]. We start by the case L = 2. Notice that each component of A(2)q N1(A(1)x) is a Fourier network with the same set of ﬁrst layer weights, of size at most (2N 1)d1. Then, by Lemma 42, we have that each component of

f2 N2,N1(x) .= A(3)g2 N2(A(2)q N1(A(1)x))

is a Fourier network with the same set of ﬁrst layer weights of size at most

(2(2N1 1)d1)N2 .

Finally, consider the induction step. By the assumption hypothesis, the function

f L 1 NL 1,...,N1(x) .= A(L)g L 1 NL 1(A(L 1) g2 N2(A(2)q N1(A(1)x)))

is such that each component is a Fourier network with the same set of ﬁrst layer weights of size at most 2L 2(2N1 1)d1 QL 1 k=2 Nk .

Then, by Lemma 42, the function

f N(x) = h a(L+1)i T g L NL(f L 1 NL 1,...,N1(x))

is a Fourier network with at most 2 2L 2(2N1 1)d1 QL 1 k=2 Nk NL = 2NL2(L 2) QL 1 k=2 Nk((2N1 1)d1) QL k=2 Nk

which implies equation (43). Plugging in the lower bounds on Nk in terms of ϵ, the thesis follows.

Venturi, Jelassi, Ozuch, Bruna

B.7 Fixed-dimension approximation

The results of Section 4 on ﬁxed-threshold approximation can be complemented by the following result on ﬁxed-dimension approximation. The proposition below is a straightforward generalization of Theorem 3 in (Safran et al., 2019).

Proposition 44 Let σ be an activation satisfying Assumption 1. Then there exists a constant β > 0 such that for any f : Bd 1,2 C 1-Lipschitz function and ϵ > 0 there exists a network f N Fσ N such that f f N Bd 1, , ϵ

for some N 2 + βd7 βϵ 1 dϵ 6.

Proof [Proof] The result is proved by noticing that the proof of Theorem 3 in (Safran et al., 2019) actually holds for any function f as in the statement. Moreover, using Assumption 1, f N can also be chosen so that an equivalent bound holds for m (f N).

Appendix C. Proofs related to spherical harmonics analysis of shallow networks

C.1 Proof of Proposition 16

Let f N : Rd R a one-hidden-layer network deﬁned by

i=1 uifσi,wi(x) .=

i=1 uiσi w T i x

where u RN, wi Sd 1, and σi are linearly bounded activations. Thanks to Parseval s formula, it holds that

f N f(d) 2 2 PIdf N PIdf(d) 2 2

PIdf(d) 2 2 2 X

i=1 ui fσi,wi, f(d) j

PIdf(d) 2 2 2 X

Nd j |ui| f(d) j fσi,wi j 2 (44)

PIdf(d) 2 2 2 f(d) 2

i=1 |ui| fσi,wi 2

j Id c2 d,j

PIdf(d) 2 2 2 O(d M) ϵdα f(d) 2

i=1 |ui| fσi,wi 2 .

Finally, notice that it holds that

fσi,wi 2 2 m (f N)

Depth separation beyond radial functions

and therefore

f N f(d) 2 2 Ω(d 2M) 4 O(d M) ϵdα m2 (f N) N .

This concludes the proof.

C.2 Low-coherence zonal harmonics frames

In this section, we wish to quantify how much incoherent can a frame composed of zonal harmonics be. More speciﬁcally, we wish to ﬁnd a lower bound for

N(d, k, ϵ) = sup

N 1 : w1, . . . , w N Sd 1 : sup i =j

P d k w T i wj ϵ

for ϵ (0, 1).

Lemma 45 It holds that

N(d, k, ϵ) sup

N 1 : w1, . . . , w N Sd 1 : sup i =j

for k > d 5 and d

k d/4 ϵ < 1.

Proof [Proof] We recall that it holds

P d k (t) 1 πΓ d 1

for d 2 and t ( 1, 1) (cfr. eq. (2.117) in (Atkinson and Han, 2012)) and that

for x 2. Therefore it holds that

P d k (t) 1 π

(d 3)/2 4 k(1 t2)

1/2 d k(1 t2)

(d 2)/2 d k(1 t2)

for d 5 and |t| < 1. In particular, for ϵ (0, 1), it holds that P d k (t) ϵ if

d k(1 t2) ϵ4/d

1 d kϵ4/d .

The thesis follows.

Venturi, Jelassi, Ozuch, Bruna

N(d, δ) = sup

N 1 : w1, . . . , w N Sd 1 : sup i =j

for δ (0, 1). The previous lemma says that

N(d, k, ϵ) N

Example 7 Taking

{wi}N i=1 =

it holds that N = 2d 1 and

w T i wj = 1 2

Taking ϵ = 2 d, it holds that, if k 8d2, then

N d, k, 2 d 2d 1 .

Using this fact it is possible to explicitly construct a high energy sparse function.

Lemma 46 Take k 16d2 even and let

i=1 (Nd k )1/2P d k (w T i x)

with βd = 2(2d + 2) 1/2 and wi as in equation (45). Then ˆP 2 = Θd(1) and it is exponen-

tially spread, that is ℓ ,2( ˆP) Od(2 d/2) q

Proof [Proof] It holds that

ˆP 2 2 = β2 d

i =j P d k w T i wj

h 2d 1 + 22d 2 2d 1 2 di

= 2 2d 1 + 1

h 2d 1 + 2d 2 2 1i 3

Depth separation beyond radial functions

ˆP 2 2 2 2d 1 + 1

h 2d 1 22d 2 2d 1 2 di

= 2 2d 1 + 1

h 2d 1 2d 2 + 2 1i 1 .

On the other hand, it holds that

ˆP βd(Nd k )1/2 sup x Sd 1

P d k (w T i x) .

By deﬁnition of the vectors {wi}2d 1 i=1 , it holds

P d k (w T i x) = 1

2 sup x Sd 1, x>0

ϵ { d 1/2}d

P d k (x T ϵ)

2 sup x Sd 1, x 0

ϵ { d 1/2}d : |1T ϵ|<

16d 1 |x T ϵ|2

1 16d 1 d 1

!(d 2)/2 1 + 2d 1 1

This proves the claim.

C.3 Proof of Proposition 21

Assume ﬁrst that f H1. Then f = hπ for some π even signed Radon measure. Thus

γ1(f) = π 1 = sup ϕ C(Sd 1) : ϕ 1

Sd 1 ϕ(w) dπ(w)

= sup ϕ C even(Sd 1) : ϕ 1

Sd 1 ϕ(w) dπ(w)

= sup ϕ C even(Sd 1) : ϕ 1

Sd 1 T(T 1ϕ)(w) dπ(w)

= sup ϕ C even(Sd 1) : ϕ 1

w T x (T 1ϕ)(x) d S(x) dπ(w)

= sup ϕ C even(Sd 1) : ϕ 1 T 1ϕ, f .

This shows one side of the statement. On the other hand, assume that

sup ϕ C even(Sd 1) : ϕ 1 T 1ϕ, f < .

Venturi, Jelassi, Ozuch, Bruna

Then, the transformation Sf(ϕ) .= T 1ϕ, f

deﬁnes a bounded linear operator Sf : C even R. Since C even(Sd 1) is dense in Ceven(Sd 1) (the set of even function in C(Sd 1)), Sf can be extended to a bounded linear operator on Ceven(Sd 1). By setting Sf(ϕ) = Sf(ϕeven)

we can extend it on C(Sd 1). By the Riesz representation theorem, there exists a signed Radon measure π on Sd 1 such that

Sd 1 ϕ(w) dπ(w)

for every ϕ C(Sd 1). Moreover, since Sf(ϕ) = 0 for every odd ϕ, we can assume that π is even. Let hπ be the function in H1 deﬁned by π. Then it holds that

T 1ϕ, f = π 1 = T 1ϕ, hπ

for every ϕ C even(Sd 1). Since T is an automorphism over C even(Sd 1), then it holds

ϕ, f = ϕ, hπ

for every ϕ C even(Sd 1). Since f and hπ are even, this implies that f = hπ. This concludes the proof.