# density_estimation_using_the_perceptron__1c86bb7d.pdf

Journal of Machine Learning Research 26 (2025) 1-51 Submitted 2/24; Revised 6/25; Published 7/25

Density Estimation Using the Perceptron

Patrik R obert Gerber PRGERBER@MIT.EDU Department of Mathematics Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA 02139, USA

Tianze Jiang TZJIANG@PRINCETON.EDU Operations Research and Financial Engineering Princeton University 98 Charlton St, Princeton, NJ, 08540, USA

Yury Polyanskiy YP@MIT.EDU Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology 32 Vassar St, Cambridge, MA 02139, USA

Rui Sun ERUISUN@STANFORD.EDU Department of Statistics Stanford University 450 Jane Stanford Way, Stanford, CA 94305, USA

Editor: Aapo Hyvarinen

We propose a new density estimation algorithm. Given n i.i.d. observations from a distribution belonging to a class of densities on Rd, our estimator outputs any density in the class whose perceptron discrepancy with the empirical distribution is at most O( p

d/n). The perceptron discrepancy is deﬁned as the largest difference in mass two distribution place on any halfspace. It is shown that this estimator achieves the expected total variation distance to the truth that is almost minimax optimal over the class of densities with bounded Sobolev norm and Gaussian mixtures. This suggests that the regularity of the prior distribution could be an explanation for the efﬁciency of the ubiquitous step in machine learning that replaces optimization over large function spaces with simpler parametric classes (such as discriminators of GANs). We also show that replacing the perceptron discrepancy with the generalized energy distance of Sz ekely and Rizzo (2013) further improves total variation loss. The generalized energy distance between empirical distributions is easily computable and differentiable, which makes it especially useful for ﬁtting generative models. To the best of our knowledge, it is the ﬁrst simple distance with such properties that yields minimax optimal statistical guarantees.

In addition, we shed light on the ubiquitous method of representing discrete data in domain [k] via embedding vectors on a unit ball in Rd. We show that taking d log(k) allows one to use simple linear probing to evaluate and estimate total variation distance, as well as recovering minimax optimal sample complexity for the class of discrete distributions on [k].

Keywords: perceptron, halfspaces, GAN, density estimation, one-hot encoding, discrete distributions, total variation

c 2025 Patrik R obert Gerber, Tianze Jiang, Yury Polyanskiy, and Rui Sun.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0261.html.

GERBER, JIANG, POLYANSKIY, AND SUN

1. Introduction

A standard step in many machine learning algorithms is to replace an (intractable) optimization over a general function space with an optimization over a large parametric class (most often neural networks). This is done in supervised learning for ﬁtting classiﬁers, in variational inference (Blei et al., 2017; Zhang et al., 2018) for applying ELBO, in variational autoencoders (Kingma and Welling, 2019) for ﬁtting the decoder, in Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Arjovsky et al., 2017) for ﬁtting the discriminator, in diffusion models (Song et al., 2020; Chen et al., 2022) for ﬁtting the score function, and many other settings. To be speciﬁc, let us focus on the example of GANs, which brought about the new era of density estimation in high-dimensional spaces. The problem setting is the following. We are given access to an i.i.d. data X1, . . . , Xn Rd sampled from an unknown distribution ν and a class of distributions G on Rd (the class of available generators ). The goal of the learner is to ﬁnd arg minν G D(ν , ν), where D is some dissimilarity measure ( metric ) between probability distributions. In the case of GANs this measure is the Jensen-Shannon divergence JS(p, q) KL(p 1

2q) where KL(p q) = R p(x) log p(x)

q(x)dx is the Kullback-Leibler divergence. As any f-divergence, JS has a variational form (see (Polyanskiy and Wu, 2023+, Example 7.5)): JS(p, q) = log 2 + suph:Rd (0,1) Ep[h] + Eq[log(1 h)] . With this idea in mind, we can now restate the objective of minimizing JS(ν , ν) as a game between a generator ν and a discriminator h, i.e., the GAN s estimator is

ν arg min ν sup h:Rd (0,1)

i=1 h(Xi) + Eν [log(1 h)] , (1)

where we also replaced the expectation over (the unknown) ν with its empirical version νn 1 n Pn i=1 δXi. Subsequently, the idea was extended to other types of metrics, notably the Wasserstein GAN (Arjovsky et al., 2017), which deﬁnes

ν arg min ν G sup f D

EY ν f(Y ) 1

where the set of discriminators D is a class of Lipschitz functions (corresponding to the variational characterization of the Wasserstein-1 distance). The ﬁnal step to turn (1) or (2) into an algorithm is to relax the domain of the inner maximization ( discriminator ) to a parametric class of neural network discriminators D. Note that replacing suph:Rd (0,1) with suph D effectively changes the objective from minimizing the JS divergence to minimizing a neural-JS , similar to how MINE (Belghazi et al., 2018) replaces the true mutual information with a neural one. This weakening is quite worrisome for a statistician. While the JS divergence is a strong statistical distance, as it bounds total variation from above and from below (Polyanskiy and Wu, 2023+, Eq. (7.39)), the neural-JS is unlikely to possess any such properties. How does one justify this restriction to a simpler class D? A practitioner would say that while taking maxh D restricts the power of the discriminator, the design of D is ﬁne-tuned to picking up those features of the distributions that are relevant to the human eye.1 A theoretician, instead,

1. Implying in other words, that whether or not total variation TV( ν, ν) is high is irrelevant as long as the generated images look good enough to humans.

DENSITY ESTIMATION USING THE PERCEPTRON

would appeal to universal approximation results about neural networks to claim that restriction to D is almost lossless. The purpose of this paper is to suggest, and prove, a third explanation: the answer is in the regularity of ν itself. Indeed, we show that the restriction of discriminators to a very small class D in (1) results in almost no loss of minimax statistical guarantees, even if D is far from being a universal approximator. That is, the minimizing distribution ν selected with respect to a weak form of the distance enjoys almost minimax optimal guarantees with respect to the strong total variation distance, provided that the true distribution ν is regular enough. Phrased yet another way, even though the neural distance is very coarse and imprecise, and hence the minimizer selected with respect to it might be expected to only fool very naive discriminators, in reality it turns out to fool any arbitrarily complex, but bounded discriminator. Let us proceed to a more formal statement of our results. One may consult Section 1.3 for notation. We primarily focus on two classes of distributions on Rd: ﬁrst, PS(β, d, C) denotes the set of distributions supported on the d-dimensional unit ball B(0, 1) that have a density with ﬁnite L2 norm and whose (β, 2)-Sobolev norm, deﬁned in (11), is bounded by C; second, PG(d) = {µ N(0, 1) : supp(µ) B(0, 1)} is the class of Gaussian mixtures with compactly supported mixing distribution. We remind the reader that the total variation distance has the variational form TV(p, q) = suph:Rd [0,1] Eph Eqh . Our ﬁrst result concerns the following class of discriminators:

D1 = {x 7 1{x v b} : v Rd, b R} , (3)

the class of afﬁne classiﬁers, which can be seen as a single layer perceptron with a threshold nonlinearity.

Theorem 1. For any β > 0, d 1 and C > 0, there exists a ﬁnite constant C1 = C1(β, d, C):

sup ν PS(β,d,C) ETV( ν, ν) C1n β 2β+d+1 , (4)

where the estimator ν is deﬁned in (2) with D = D1 and G = PS(β, d, C). Similarly, for any d 1 there exists a ﬁnite constant C2 = C2(d) so that

sup ν PG(d) ETV( ν, ν) C2 (log(n)) 2d+2

where the estimator ν is deﬁned in (2) with D = D1 and G = PG(d).

Recall the classical result (Ibragimov and Khasminskii, 1983) which shows that the minimax optimal estimation rate in TV over the class PS(β, d, C) equals n β/(2β+d) up to constant factors. Thus, the estimator in (4) is almost optimal, the only difference being that the dimension d is replaced by d + 1. Similarly, for the Gaussian mixtures we reach the parametric rate up to a polylog factor.2

2. For estimation of Gaussian mixtures in total variation the precise value of the minimax optimal polylog factor is at present unknown. However, for the L2 distance the minimax rate is known, and in the course of our proofs (see (24)) we show that our estimator only loses a multiplicative factor of log(n)1/4 in loss compared to the optimal L2-rate log(n)d/4/ n derived in (Kim and Guntuboyina, 2022).

GERBER, JIANG, POLYANSKIY, AND SUN

The proof of Theorem 1 relies on a comparison inequality between total variation and the perceptron discrepancy , or maximum halfspace distance, which we deﬁne as

d H(µ, ν) sup f D1 {Eµf Eνf}. (5)

Note ﬁrst that d H TV clearly holds since all functions in the class D1 are bounded by 1. For the other direction, by proving a generalization of the Gagliardo-Nirenberg-Sobolev inequality we derive the following comparisons.

Theorem 2. For any β > 0, d 1, there exists a ﬁnite constant C1 = C1(β, d):

2β C1 ( µ 2 β,2 + ν 2 β,2) d+1

4β d H(µ, ν) (6)

holds for all µ, ν PS(β, d, ). Similarly, for any d 1 there exists a ﬁnite constant C2 = C2(d) such that

TV(µ, ν) log 3 + 1 TV(µ, ν)

2 C2d H(µ, ν)

holds for all µ, ν PG(d).

We remark that we also show (in Proposition 12) that the exponent 2β+d+1

2β in (6) is tight, i.e. cannot be improved in general. With Theorem 2 in hand the proof of Theorem 1 is notably simple. For example, let us prove (4) (for full details, see Section 4.2). Recall that Xi iid ν, νn is the empirical distribution and ν = arg minν PS d H(ν , νn). We then have from the triangle inequality and minimality of ν:

d H( ν, ν) d H( ν, νn) + d H(νn, ν) 2d H(νn, ν) .

Thus, from Theorem 2 we have

TV( ν, ν) d H(νn, ν)

2β 2β+d+1 . (7)

Lastly, we recall that D1 is a class with ﬁnite VC-dimension and thus from uniform convergence (Theorem 8.3.23, Vershynin (2018)) we have for some dimension-dependent constant C(d) that

E[d H(νn, ν)] C(d) n .

Thus, applying expectation and Jensen s inequality to (7) we get

E[TV( ν, ν)] E[d H(νn, ν)

2β 2β+d+1 ] E[d H(νn, ν)]

2β 2β+d+1 n β 2β+d+1

as claimed. While we believe that Theorem 1 provides theoretical proof for the efﬁcacy of simple discriminators, it has several theoretical and practical deﬁciencies that we need to address. First, the guaran-

tee in Theorem 1 for PS is strictly worse than the minimax optimal rate, which is O n

β 2β+d (see e.g. Ibragimov and Khasminskii (1983)).

DENSITY ESTIMATION USING THE PERCEPTRON

Second, from the implementation point of view, computing the distance d H behind Theorem 1 is impractical. Indeed, ﬁnding the halfspace with maximal separation between even two empirical measures is a nonconvex, non-differentiable problem and takes super-poly time in the dimension d assuming P = NP (Guruswami and Raghavendra, 2009), and ω(dω(ε 1)) time for ε-optimal agnostic learning between two densities assuming either SIVP or gap SVP (Tiegel, 2023). Finally, even if we disregard the computational complexity, it is unclear how to minimize ν concerning arg minν d H(ν , νn) for given samples. This concern is alleviated by the fact that any ν satisfying d H( ν, νn) = O( p

d/n) will work without degrading our performance guarantee, and thus only an approximate minimizer is needed. To address the above concerns, we make two changes to improve d H in the min-distance density estimator: (a) we replace the perceptron class D1 in (3) with a generalized perceptron Dγ for γ (0, 2) \ {1} deﬁned as:

Dγ = {x 7 |x v b| γ 1

2 : v Rd, b R} , γ (0, 2) \ {1} (8)

(b) we replace the perceptron discrepancy d H (deﬁned with respect to the best perceptron) with an average version d H deﬁned in (13) (see (18) for the deﬁnition with general γ). Therefore, one does not even need to ﬁnd an approximately optimal half-space, as random half-spaces provide sufﬁcient discriminatory power. These changes are made precise in Section 2. Our Theorem 18 and Corollary 19 show that these two changes allow us to achieve a total variation rate of n β/(2β+d+γ) for the min distance density estimator. The improved rate comes to (within polylog(n)) minimax optimality as γ 0 adaptively with n, addressing our ﬁrst concern. Somewhat unexpectedly, we discover that the average perceptron discrepancy d H exactly equals Sz ekely and Rizzo s energy distance E1 (Deﬁnition 1, Sz ekely and Rizzo (2013)), deﬁned as

E2 1(µ, ν) E 2 X Y X X Y Y , (X, X , Y, Y ) µ 2 ν 2 , (9)

where is the usual Euclidean norm on Rd. Thus, our Theorem 18 (with γ = 1) shows that minimizing minν E1(ν , νn) gives a density estimator with rates over PS and PG as given in Theorem 1. Furthermore, for γ > 1 the corresponding average over Dγ results in the distance Eγ known as generalized energy distance, deﬁned in the same paper. See Section 2 for details. This discovery addresses our second and ﬁnal concern in the following sense. Treating our distance Eγ(eνgen. m , νtarget n ) as a loss function for solving eνgen. m , the closed-form computation of E via (9) requires only a polynomial O(n2 + m2) steps, and is friendly to gradient evaluations. Overall, our message from an algorithmic point of view is as follows: assuming one has access to a parametric family of generators sampling from νθ for parameters θ Rp, and if one can compute θ of the generator forward pass, e.g., via pushforward of a reference distribution under a smooth transport map or neural network-based models (Wang and Marzouk (2022); Marzouk et al. (2023)), then one can ﬁt θ to the empirical sample νn by running stochastic gradient descent steps:

sample m samples from νθ and form the empirical distribution ν m,

compute the loss Eγ (ν m, νn) and backpropagate the gradient with respect to θ,

update θ θ η θEγ (ν m, νn) for some step size η.

Again, computing Eγ(ν m, νn) via (9) requires O(n2 + m2) steps and is gradient-friendly. Note, though, that obtaining and certifying good parameterizations for generators with densities satisfying Sobolev norm constraints is nontrivial and may require careful regularization. In this work, however, our focus will be on the discriminator part of GANs, as opposed to the generator.

GERBER, JIANG, POLYANSKIY, AND SUN

1.1 Contributions

This work focuses on studying the discriminator classes for adversarially estimating smooth densities, our main contributions are as follows. We show that β-smooth distributions, Gaussian mixtures and discrete distributions that are far apart in total variation distance must possess a halfspace on which their mass is substantially different (Theorems 2, 11 and 14). We apply the separation results to density estimation problems, showing that an ERM density estimator nearly attains the minimax optimal density estimation rate with respect to TV over the aforementioned distribution classes (Theorems 1, 18 and 21), suggesting that halfspaces indicators are (almost) sufﬁcient for discriminator classes in GAN (2) in order to achieve (close to) minimax optimal density estimation in the considered classes of distributions. In Section 2 we show that the average halfspace separation distance d H is equal up to constant to the energy distance E1 (Theorem 4), which has many equivalent expressions: as a weighted L2-distance between characteristic functions (Proposition 5), as the sliced Cram er-2 distance (Theorem 8), as an IPM/MMD/energy distance (Section 2.3), and as the L2-norm of the Riesz potential (Proposition 9). We generalize the average halfspace distance d H to include an exponent γ (0, 2), corresponding to the generalized energy distance Eγ. Consequently, we discover that if instead of thresholded linear features 1{v x > b} we use the non-linearity |v x b|γ, smooth distributions and Gaussian mixtures can be separated even better (Theorem 11). Combined with the fact that Eγ, similarly to d H, decays between population and sample measures at the parametric rate (Lemma 7), the ERM for Eγ reduces the slack in the density estimation rate, achieving minimax log-optimality (Corollary 19). This result, combined with its strong approximation properties, supports its use in modern generative models (e.g. Ho et al. (2020); Goodfellow et al. (2014); Rombach et al. (2022); Ramesh et al.). Finally, Proposition 24 shows that recent work applying d H for two-sample testing is suboptimal over the class of smooth distributions in the minimax sense.

1.2 Related Work

An important concept related to our work is the Integral Probability Metrics (IPMs)

d IPM D (P, Q) sup f D |EPf EQf| (10)

where D is the discriminator class. Examples include TV when D is the class of bounded functions and W1 when D is the class of Lipschitz functions. IPM distances represent the max separation between two distributions one gets from an adversary powered with D, and can be viewed as the generator objective in GAN (2). In terms of distance comparison inequalities on smooth density classes similar to our Theorem 2, the closest work we found was Chae and Walker (2020), in which the authors have shown comparison between TV W β/(β+1) 1 for (L1, β) Sobolev smooth densities on R. This inequality is also optimal in the exponent. Subsequently, Chae (2024) extended comparison inequality to Besov densities on Rd and Wp metrics. These results3 serve the same purpose as ours: they show that achieving optimal estimation in a weak norm (such as W1) implies optimal estimation rate under

3. We thank anonymous reviewers for pointing them to us.

DENSITY ESTIMATION USING THE PERCEPTRON

a strong norm. Since W1 corresponds to a non-parametric discriminator class of all Lipschitz functions, this result does not explain why simple discriminators work so well in practice. Another paper related to distance comparison is Bai et al. (2018) whose authors study comparison between the Wasserstein distance W1 and the IPM drelu deﬁned by the discriminator class D = {x 7 Relu(x v + b) : b, v 1}. They show (Bai et al., 2018, Theorem 3.1) that p

κ/d W1 drelu W1 for Gaussian distributions with mean in the unit ball, where κ is an upper bound on their condition numbers and d is the dimension. They obtain results for other distribution classes (Gaussian mixtures, exponential families), but for each of these they use a different class of discriminators that is adapted to the problem. In the density estimation literature, an estimator of the form (2) applied on smooth densities ﬁrst appeared in the famous work of Yatracos (1985). Instead of indicators of halfspaces, they consider the class of discriminators

Yϵ {1{dνi/dνj 1} : 1 i, j N(ϵ, G)},

where ν1, . . . , νN(ϵ,G) forms a minimal ϵ-TV covering of the class G and N(ϵ, G) is the so-called covering number. Writing d Y (µ, µ ) = supf Y(Eµf Eµ f), it is not hard to prove that |TV d Y | = O(ϵ) on G G and that Ed Y (ν, νn) p

log N(ϵn, G)/n by a union bound coupled with a binomial tail inequality. From here ETV( ν, ν) p

log N(ϵ, G)/n + ϵ follows by the triangle inequality (here ν is deﬁned as in (2) with D = Y). Note that in contrast to our perceptron discrepancy d H, Yatracos estimator attains the optimal rate on G = PS, corresponding to the choice ϵ = ϵ(n) n β/(2β+d). In terms of GAN (or more generally any adversarial) density estimation, a series of papers (Singh et al. (2018), Uppal et al. (2019), Chen et al. (2020), Belomestny et al. (2021), Liang (2021), Chae (2022), St ephanovitch et al. (2023), etc) study the problem of density estimation over smoothness classes with respect to IPM distances. These works typically choose the discriminator class to be a large neural network of size growing with n, β and giving a universal approximation of some non-parametric discriminator class dependent on β. Given such a ﬁne approximation to discriminator class, one obtains IPM error bounds (on e.g. W1), which is then converted to TV bound via comparison inequalities. In this paper, we stress again, the discriminator class is extremely small and weak (a collection of half-spaces) and does not depend on smoothness β or number of samples n. We provide more detailed literature review and discussion in Section A. Another paper by Oko et al. (2023) derives minimax density estimation guarantees for a class of diffusion-based estimators. Their result is similar to ours in that it obtains rigorous (near-)optimal guarantees for a method that is a realistic model of what is currently done in practice. However, their analysis also rely crucially on the universal approximation property of neural networks for the score function. In this present work, we mainly focus on the ﬁxed and interpretable discriminator class {x 7 f(x v b) : v 1, b R} for a set of prescribed f and derive comparisons to TV for smooth distributions, Gaussian mixtures, and discrete distributions. In addition, we prove the (near)-optimality of our results (for smooth densities) and also derive nonparametric estimation rates for the corresponding GAN density estimators. Our work is, to the best of our knowledge, closest to the ﬁrst that satisﬁes:

1. Uses a vanilla GAN-type approach ((2)) with ERM (which is close to practice) and obtains total variation rates that are (close to) minimax optimal.

GERBER, JIANG, POLYANSKIY, AND SUN

2. Discriminator class is oblivious to the smoothness parameters: parts of our results are also noblivious, and the ones that depend on n only concern the activation function. Our proposed discriminator class is also very simple: we use a composition of a single non-linearity and a linear function (i.e. it is a VC class with dimension d + 1).

The closest works to the above objectives are Belomestny et al. (2021) and St ephanovitch et al.

(2023), model 3, both of which derived log-optimal TV risk exponent e O n

β 2β+d via (2). However, they both relied on the universal approximation of neural networks which has to be larger than some function of β, the smoothness parameter, as well as n, the sample size. Hence another message, in contrast to the above works on discriminator networks, is that on an extremely simple class of discriminators parametrized by half-spaces (even with ﬁnite VC dimension) without the need for training/optimization, one can get almost optimal density estimation rates by averaging random halfspace distances (Theorem 18). Moreover, if any generator ﬁts empirical to within O(1/ n) of our distance (this rate is oblivious to β), we get TV with high probability (Proposition 23). Independent of this work, recent results by Paik et al. (2023) investigate the halfspace separability of distributions for the setting of two-sample testing. However, their focus was on the asymptotic power of the test as the number of samples grows to inﬁnity. Our lower bound construction presented in Section E proves that their proposed test is sub-optimal in the minimax setting. See Section 5 for a more detailed discussion.

1.3 Notation

The symbols O, o, Θ, Ω, ω follow the conventional big-O notation, and O, o hide polylogarithmic factors. We use , and throughout our calculations to hide multiplicative constants that are irrelevant (depending on the context). Given a vector x Rd, we write x for its Euclidean norm and x, y x y for the Euclidean inner product of x, y Rd. The Gamma function is denoted by Γ. We write B(x, r) {y Rd : x y r}, Sd 1 {x Rd : x = 1} and σ for the unnormalized surface measure on Sd 1. The surface area of a unit (d 1)-sphere is also written as σ(Sd 1) = 2πd/2/Γ d

2 . In particular, if X is a random vector uniformly distributioned on Sd 1

then for any h we have

E[h(X)] = 1 σ(Sd 1)

Rd h(y)dσ(y) .

The convolution between functions/measures is denoted by . We write Lp(Rd) for the space of (equivalence classes of) functions Rd C that satisfy f p R

Rd |f(x)|pdx 1/p < . The space of all probability distributions on Rd is denoted as P(Rd). For a signed measure ν we write supp(ν) for its support and Mr(ν) R x rd|ν|(x) for its r th absolute moment. Given P, Q P(Rd) we write TV(P, Q) sup A Rd[P(A) Q(A)] for the total variation distance, where the supremum is over all measurable sets. Given a function f L1(Rd), deﬁne its Fourier transform as

bf(ω) F[f](ω) Z

Rd e i x,ω f(x)dx.

Given a ﬁnite signed measure ν on Rd, deﬁne its Fourier transform as F[ν](ω) R

Rd e i ω,x dν(x). We extend the Fourier transform to L2(Rd) and tempered distributions in the standard manner.

DENSITY ESTIMATION USING THE PERCEPTRON

Given f L2(Rd) and β > 0, deﬁne its homogenous Sobolev seminorm of order (β, 2) as

Rd ω 2β| bf(ω)|2dω. (11)

Further, we deﬁne two speciﬁc classes of functions of interest as follows: PS(β, d, C) is a set of smooth densities while PG(d) is a set of all Gaussian mixtures with support in the unit ball, formally

PS(β, d, C) {µ P(Rd) : supp(µ) B(0, 1), µ has density p with p β,2 < C},

PG(d) {ν N(0, Id) : ν P(Rd), supp(ν) B(0, 1)}.

Assumption 1. Throughout the paper we assume that C in the deﬁnition of PS(β, d, C) is large enough relative to β and d, such that PS(β, d, C/2) is non-empty.

1.4 Structure

The structure of the paper is as follows. In Section 2 we introduce the generalized energy distance, the main object of our study. We show how it relates to the perceptron discrepancy d H and its relaxation d H; we record equivalent formulations of the generalized energy distance, one of which is a novel sliced-distance form. In Section 3, we present our main technical results on comparison inequalities between total variation and the energy distance. In Section 4 we analyse the density estimator that minimizes the empirical energy distance, and prove Theorem 1 and Theorem 2 in Section 4.2. In Section 5 we show that the use of d H for two sample testing results in suboptimal performance. We conclude in Section 6. All omitted proofs and auxiliary results are deferred to the Appendix.

2. The Generalized Energy Distance

Given two probability distributions µ, ν on Rd with ﬁnite γ th moment, the generalized energy distance of order γ (0, 2) between them is deﬁned as

Eγ(µ, ν) = E h 2 X Y γ X X γ Y Y γi , where (X, X , Y, Y ) µ 2 ν 2. (12) As we alluded to in the introduction, the proof of Theorems 1 and 2 becomes possible once we relax the supremum in the deﬁnition of d H to an unnormalized average over halfspaces. In Section 2.1 we discuss this relaxation in more detail and identify a connection to the energy distance E1 deﬁned above in (12). Motivated by this, we study the (generalized) energy distance and give multiple equivalent characterizations of it from Section 2.1 to Section 2.5.

2.1 From Perceptron Discrepancy to Energy Distance

Our ﬁrst goal is to connect the study of d H to the study of Eγ with γ = 1. To achieve this, we introduce an intermediary, the average perceptron discrepancy d H. Given two probability distributions µ, ν on Rd, we deﬁne

v,x b dµ(x) dν(x)

!2 dbdσ(v), (13)

GERBER, JIANG, POLYANSKIY, AND SUN

where σ denotes the surface area measure. If the two distributions µ, ν are supported on a compact set, then the overall deﬁnition can indeed be regarded as a mean squared version of perceptron discrepancy, because the integrals over b and v only range over bounded sets. However, in general, the integral over b in the deﬁnition of d H is not normalizable and that is why we put average in quotes. Nevertheless, we have the following comparisons between d H and d H.

Proposition 3. For any β > 0, d 1, C > 0, and for all µ, ν PS(β, d, ), we have

4πd/2 d H(µ, ν) d H(µ, ν). (14)

Moreover, for all d 1, there exists a ﬁnite constant C1 = C1(d) such that for all µ, ν PG(d),

d H(µ, ν) log(3 + 1/d H(µ, ν))1/4 C1d H(µ, ν).

Proof The proof of (14) is immediate after noting that all distributions in PS(β, d, ) are supported on the d-dimensional unit ball and that R

v Sd 1 R 1 1 dbdσ(v) = 4πd/2/Γ(d/2). Thus, we focus on the Gaussian mixture case. Write µ ν = τ φ where φ denotes the density of the standard Gaussian N(0, Id) and τ P(Rd) is the difference of the two implicit mixing measures. For any R > 0, we have

d H(µ, ν) sup v Sd 1,|b| R

x,v b (τ φ)(x)dx

v u u t 1 2R vold 1(Sd 1)

x,v b (τ φ)(x)dx

!2 dbdσ(v).

Now, since τ is supported on a subset of B(0, 1) by deﬁnition of the class PG(d), for any v Sd 1

and R 2 we have the bound

Rd φ(x y)dτ(y)dx

x,v |b| exp( ( x 1)2/2)dx

x |b| exp( x 2/8)dx

exp( Ω(R2)),

where we implicitly used that R dτ = 0 as τ is the difference of two probability distributions. Choosing R p

log(3 + 1/d H(µ, ν)) concludes the proof.

Proposition 3 implies that to obtain a comparison between TV and d H, speciﬁcally for lower bounding d H, it sufﬁces to consider the relaxation d H instead. The next observation we make is that d H is in fact equal, up to constant, to the energy distance.

DENSITY ESTIMATION USING THE PERCEPTRON

Theorem 4. Let µ, ν be probability distributions on Rd with ﬁnite mean. Then

d H(µ, ν) = π(d 1)/4

2 ) E1(µ, ν).

Proof This is a direct implication of (15) and (18). We defer the proof to Theorem 8 which is its direct generalization as the interpretation of Eγ as the average Dγ distance for all γ (0, 2).

As will be clear from the rest of the paper, it does pay off to study Eγ for general γ, even though so far we only justiﬁed its relevance to the results stated in the introduction for the case of γ = 1. With this in mind, we proceed to study various properties of the generalized energy distances {Eγ}γ (0,2).

2.2 The Fourier Form

The formulation of the generalized energy distance that we rely on most heavily in our proofs is the following.

Proposition 5 ((Sz ekely and Rizzo, 2013, Proposition 2)). Let γ (0, 2) and let µ, ν be probability distributions on Rd with ﬁnite γ th moment. Then,

E2 γ(µ, ν) = Fγ(d) Z

Rd |bµ(ω) bν(ω)|2

ω d+γ dω, (15)

where we deﬁne Fγ(d) = γ2γ 1Γ( d+γ

2 ) πd/2Γ(1 γ

Remark 6. Note that Fγ(d) = Θ γ(2 γ)Γ d+γ

2 π d/2 up to a universal constant.

This shows that the generalized energy distance is a weighted L2 distance in Fourier space. The fact that Eγ is a valid metric on probability distributions with ﬁnite γ th moment is a simple consequence of Proposition 5.

2.3 The MMD and IPM Forms

Another interpretation of the generalized energy distance is through the theory of Maximum Mean Discrepancy (MMD). Given a set X and a positive semideﬁnite kernel k : X 2 R, there is a unique reproducing kernel Hilbert space (RKHS) Hk consisting of the closure of the linear span of {k(x, ), x X} with respect to the inner product k(x, ), k(y, ) Hk = k(x, y). For a probability distribution µ on X, deﬁne its kernel embedding as θµ = R

Rd k(x, )dµ(x). As shown in (Muandet et al., 2017, Lemma 3.1), the kernel embedding θµ exists and belongs to the RKHS Hk if E[ p

k(X, X )] < for (X, X ) µ 2 as is the case for our kernel deﬁned later in Equation (16). Then, given two probability distributions µ and ν, the MMD measures their distance in the RKHS by MMDk(µ, ν) θµ θν Hk.

GERBER, JIANG, POLYANSKIY, AND SUN

We refer the reader to Sch olkopf and Smola (2001); Muandet et al. (2017) for more details on the underlying theory. MMD has a closed form thanks to the reproducing property:

MMD2 k(P, Q) = E h k(X, X ) + k(Y, Y ) 2k(X, Y ) i ,

where (X, X , Y, Y ) µ2 ν2. Moreover, it also follows that MMD is an Integral Probability Metric (IPM) where the supremum is over the unit ball of the RKHS HK:

MMDk(µ, ν) = sup f Hk: f Hk 1 E[f(X) f(Y )].

In our case, we can deﬁne the kernel

kγ(x, y) = x γ + y γ x y γ (16)

for γ (0, 2), which in one dimension corresponds to the covariance operator of fractional Brownian motion. For a proof of the nontrivial fact that kγ above is positive deﬁnite see for example Sejdinovic et al. (2013). With the choice of kγ it follows trivially from its deﬁnition that the generalized energy distance Eγ is equal to the MMD with kernel kγ, i.e.

Eγ(µ, ν) = MMDkγ(µ, ν)

for all distributions µ, ν with ﬁnite γ th moment. It is noteworthy that while d H is by deﬁnition an IPM, so is its averaged version d H. A straightforward consequence of the above characterization is the fact that Eγ decays at the parametric rate between empirical and population measures. This is not terribly surprising as analogous results hold for arbitrary MMDs with bounded kernel, see for example (Gretton et al., 2012, Theorem 7). Recall that Mt(ν) denotes the t th absolute moment of the measure ν.

Lemma 7. Let ν be a probability distribution on Rd and let νn = 1

n Pn i=1 δXi for an i.i.d. sample X1, . . . , Xn from ν. Then, for any γ (0, 2),

EE2 γ(ν, νn) 2Mγ(ν)

For a high-probability bound when ν is compactly supported, see Lemma 22. Proof Let X1, . . . , Xn be an additional i.i.d. sample from ν, and write νn for the corresponding empirical measure. Using the deﬁnition of Eγ in (12), we can compute

EE2 γ(νn, νn) = E h 2

j=1 Xi Xj γ 1

j=1 Xi Xj γ

j=1 Xi Xj γi

n E X1 X2 γ.

The conclusion then follows from taking the expectation of the expression

E h E2 γ( νn, νn) νn i = E2 γ(ν, νn) + 1

n E X1 X2 γ

DENSITY ESTIMATION USING THE PERCEPTRON

and the inequality |x + y|γ 2max{0,γ 1}(|x|γ + |y|γ) for all x, y R.

As a clariﬁcation remark, while IPM distances also appear in the formulation of GAN (2) with respect to the discriminator class D, our distance Eγ is not the IPM of Dγ, but rather that of the RKHS ball of kγ deﬁned above. For γ = 1, the IPM for Dγ returns d H, which is lower bounded by (up to constant, see Proposition 3) d H = Θ(E1). For γ = 1, we leave the study of the IPM for Dγ as well as the characterization of the RKHS ball of kγ out of scope. Despite not being precisely the IPM of Dγ, our next characterization relates Eγ to Dγ in the same way d H is to D1 via a slicing perspective.

2.4 The Sliced Form

Another equivalent characterization of the generalized energy distance is in the form of a sliced distance. Sliced distances are calculated by ﬁrst choosing a random direction on the unit sphere, and then computing a one-dimensional distance in the chosen direction between the projections of the two input distributions. For γ (0, 2) deﬁne the function

( |x|(γ 1)/2 for γ = 1 1{x 0} otherwise. (17)

The following result, to the best of our knowledge, has not appeared in prior literature except for the case of γ = 1. This is also the direct generalization of Theorem 4.

Theorem 8. Let γ (0, 2) and let µ, ν be probability distributions on Rd with ﬁnite γ th moment. Then for (X, Y ) µ ν we have

E2 γ(µ, ν) = 1

h Eψγ( X, v b) Eψγ( Y, v b) i2 dbdσ(v), (18)

where Sγ = π d 2 +1Γ(1 γ

γ2γ 1Γ( d+γ

2 ) cos2( π(γ 1)

2 )2 when γ = 1 and S1 = π d 1

The proof of Theorem 8 hinges on computing the Fourier transform of the function ψγ, which can be interpreted as a tempered distribution. We point out a special property of the integral on the right hand side of (18). After expanding the square, one ﬁnds that the individual terms in the sum are not absolutely integrable for γ = 1. However, due to cancellations within the squared quantity, the integral is ﬁnite. As claimed, using the language of Nadjahi et al. (2020), Theorem 8 allows us to interpret Eγ as a sliced probability divergence. Given v Sd 1, write θv = v, and θv#ν = ν θv for the pushforward of ν under θv. We have

Sγ(d)E2 γ(µ, ν) = Sγ(1) Z

Sd 1 E2 γ(θv#µ, θv#ν)dσ(v).

We may also observe that the energy distance E1 is equal to the sliced Cram er-2 distance up to constant, which has been studied recently by both theoretical and empirical works (Knop et al., 2018; Kolouri et al., 2022).4

4. The Cram er-p distance is simply the Lp distance between cumulative distribution functions.

GERBER, JIANG, POLYANSKIY, AND SUN

2.5 The Riesz Potential Form

The generalized energy distance can also be linked to the Riesz potential (Landkof, 1972, Chapter 1.1), which is the inverse of the fractional Laplace operator. Given 0 < s < d, the Riesz potential Isf of a compactly supported signed measure f on Rd is deﬁned (in a weak sense) by

Isf = f Ks,

where Ks(x) = c 1 s x s d and cs = πd/22sΓ(s/2)Γ((d s)/2) 1. The Fourier transform of the Riesz kernel is given by c Ks(ω) = ω s, interpreted as a tempered distribution. The following proposition is derived by setting s = d+γ

2 and using the Fourier form (Proposition 5) of the energy distance.

Proposition 9. Let γ (0, min{d, 2}) and let µ, ν be compactly supported probability distributions on Rd. Then

Eγ(µ, ν) = (2π)d/2q

Fγ(d) I d+γ

2 (µ ν) 2. (19)

3. Main Comparison: TV Versus Energy

After considering the connection of the perceptron discrepancy d H to the energy distance in Section 2, we turn to some of our main technical results, which provide novel quantitative comparisons between {Eγ}γ (0,2) and the total variation distance. In Section 3.1 we show that the generalized energy distance is upper bounded by total variation for compactly supported distributions. In Section 3.2 we derive lower bounds on the generalized energy distance in terms of the total variation distance over the two distribution classes that we have introduced, namely smooth distributions and Gaussian mixtures. Finally, in Section 3.3 we turn to the case of discrete distributions, which requires alternative techniques.

3.1 Upper Bound Arbitrary Compactly Supported Distributions

Note that we (obviously) always have d H(µ, ν) TV(µ, ν) for arbitrary probability measures µ and ν. Moreover, for distributions supported on a unit ball we also have d H(µ, ν) d H(µ, ν). Therefore, by the identiﬁcation of d H and E1 (Theorem 4), we can see that for distributions with bounded support, we always have E1(µ, ν) TV(µ, ν) . The next result generalizes this estimate for all Eγ, not just γ = 1.

Proposition 10. For any dimension d 1 and γ (0, 2) there exists a ﬁnite constant c = c(d, γ) such that for any two probability distributions µ, ν supported on the unit ball we have

Eγ(µ, ν) c TV(µ, ν) .

Proof The proof directly follows from x y γ 2γ and so E2 γ(µ, ν) = E h 2 X Y γ X

X γ Y Y γi = R x y γ(dµ(x) dν(x))(dµ(y) dν(y)) 2γTV(µ, ν)2, where

(X, X , Y, Y ) µ 2 ν 2.

DENSITY ESTIMATION USING THE PERCEPTRON

3.2 Lower Bound Smooth Distributions And Gaussian Mixtures

In Section 3.1 we showed that the energy distance is upper bounded by total variation for compactly supported measures. In this section we look at the reverse direction, namely, we aim to lower bound the energy distance by total variation.

Theorem 11. For any β > 0, d 1, there exists a ﬁnite constant C1 = C1(β, d) so that

γ(2 γ)TV(µ, ν)

2β C1 ( µ 2 β,2 + ν 2 β,2)

4β Eγ(µ, ν) (20)

for any µ, ν PS(β, d, ) and γ (0, 2). Similarly, for any d 1 there exists a ﬁnite constant C2 = C2(d) such that p

γ(2 γ)TV(µ, ν)

log(3 + 1/TV(µ, ν)) 2d+γ

4 C2Eγ(µ, ν) (21)

for every µ, ν PG(d) and γ (0, 2).

Proof Abusing notation, identify µ and ν with their Lebesgue densities. The argument proceeds trough a chain of inequalities:

1. Bound TV by the L2 distance between densities.

2. Apply Parseval s Theorem to pass to Fourier space.

3. Apply H older s inequality with well-chosen exponents.

Proof of (20). Jensen s inequality implies that

2TV(µ, ν) = µ ν 1 p

vol(B(0, 1)) µ ν 2 µ ν 2,

where vol denotes volume and we discard dimension-dependent constants. This completes the ﬁrst step of our proof. For the second step note that µ, ν L2(Rd) and we may apply Parseval s theorem to obtain µ ν 2 2 = 1 (2π)d bµ bν 2 2.

For arbitrary ϕ > 0 and r [1, ], H older s inequality with exponents 1

r = 1 implies that

bµ bν 2 2 = Z

Rd |bµ(ω) bµ(ω)|2 ω ϕ

Rd |bµ(ω) bν(ω)|2 ω ϕrdω 1/r Z

Rd |bµ(ω) bν(ω)|2

ω ϕr dω 1/r .

Now, we choose ϕ and r to satisfy

ϕr = d + γ.

The ﬁrst equation ensures that the ﬁrst integral term is bounded by µ ν 2/r β,2, which is assumed to be at most a d, β dependent constant. The second equation ensures that the second integral term is

GERBER, JIANG, POLYANSKIY, AND SUN

equal to (Eγ(µ, ν)2/Fγ(d))1/r by Proposition 5. The solution to this system of equations is given by r = (2β + d + γ)/(2β) and ϕ = 2β d+γ 2β+d+γ . Note that clearly ϕ > 0 and r 1. Thus, after rearrangement and using that Fd(γ) = Θ(γ(2 γ)) up to a dimension dependent constant, we obtain

γ(2 γ) bµ bν

2β 2 C1 ( µ 2 β,2 + ν 2 β,2)

4β Eγ(µ, ν),

for a ﬁnite constant C1 = C1(d, β), concluding the proof. Proof of (21). We write C(d) (0, ) for a dimension dependent constant that may change from line to line. The outline of the argument is analogous to the above, with the additional step of having to bound the (β, 2)-Sobolev norm of the Gaussian density as β for which we rely on Lemma 30. Let µ and ν have densities p φ and q φ, where φ is the density of N(0, Id). Writing f = (p q), we can extend the proof of (Jia et al., 2023, Theorem 22) to multiple dimensions to ﬁnd, for any R > 2, that

2TV(µ, ν) = µ ν 1 = Z

x R |(f φ)(x)| dx + Z

Rd φ(x y)df(y) dx

vold(B(0, R))

x R |(f φ)(x)|2dx + Z

x >R exp( x 2/8)dx

C(d) Rd/2 µ ν 2 + exp( Ω(R2)) ,

where the second line uses that supp(f) B(0, 1). Taking R p

log(3 + 1/ µ ν 2) we obtain the inequality TV(µ, ν) C(d) µ ν 2 log(3 + 1/ µ ν 2)d/4. (23)

By H older s inequality we obtain

bf 2 ω β bf(ω)

d+γ 2β+d+γ 2

= ω β bf(ω)

d+γ 2β+d+γ 2 Eγ(µ, ν)

2β 2β+d+γ Fγ(d) β 2β+d+γ

by Proposition 5. Using that | bf| |bφ| and applying Lemma 30, for β 1 we get

β 2β+d+γ bf 2 Eγ(µ, ν)

2β 2β+d+γ 5πd/2

2 ! d+γ 2(2β+d+γ) .

Rearranging and using Parseval s Theorem, we get

Eγ(µ, ν) C(d) p

γ(2 γ) f 2 f

2e (d+γ)(2β+d 1)

DENSITY ESTIMATION USING THE PERCEPTRON

for some d-dependent, albeit exponential, constant C(d) > 0. Plugging in β = log(3 + 1/ f 2) and assuming that f 2 is small enough in terms of d, we obtain

Eγ(µ, ν) C(d) p

log(3 + 1/ f 2) d+γ

γ(2 γ)TV(µ, ν)

log(3 + 1/TV(µ, ν)) 2d+γ

where the second inequality uses (23) and Lemma 26.

Theorem 11 is our main technical result, which shows that Eγ is lower bounded by a polynomial of the total variation distance for both the smooth distribution class PS and Gaussian mixtures PG. Note also that in one dimension, (20) follows from the Gagliardo Nirenberg-Sobolev interpolation inequality. However, to our knowledge, the inequality is new for d > 1. As for the tightness of Theorem 11, we manage to prove that this inequality is the best possible for PS in one dimension, and best possible up to a poly-logarithmic factor in dimension 2 and above.

Proposition 12. For any β > 0, d 1, γ (0, 2) and C > 0 satisfying Assumption 1, there exists a ﬁnite constant C1 = C1(β, d, γ, C) so that for any value of ϵ (0, 1), there exist µϵ, νϵ PS(β, d, C) such that TV(µϵ, νϵ)/ϵ (1/C1, C1) and

Eγ(µϵ, νϵ) C1TV(µϵ, νϵ)

2β log 3 + 1 TV(µϵ, νϵ)

In the special case γ = 1 we obtain an even stronger notion of tightness.

Proposition 13. When γ = 1 we may replace E1 by d H in Proposition 12.

Proposition 13 is an improvement over Proposition 12 due to the inequality d H d H over the class PS(β, d, C), which follows from Proposition 3. It shows also that our construction has the property that there does not exist any halfspace that separates µ and ν better than our bounds suggest. The proofs of both results are presented in Section E. The general idea is to saturate H older s inequality in (22), for which the Fourier transform of f = dν

dx should be supported on a sphere. However, such f clearly cannot be compactly supported. Thus the actual construction is to multiply the Fourier inverse of the uniform measure on a sphere with a compactly supported molliﬁer. In d > 1 the molliﬁer that we require must have super-polynomial Fourier spectrum decay, for which we use the recent construction in Cohen (2023).

3.3 Lower Bound Discrete Distributions

Suppose we have two discrete distributions that are supported on a common, ﬁnite set of size k. One way to measure the energy distance between them would be to identify their support with the set {1, 2, . . . , k}, thereby embedding the two distributions in R, and applying the one-dimensional energy distance. While the above approach seems reasonable, it is entirely arbitrary. Indeed, there might not be a natural ordering of the support; moreover, why should one choose the integers between 1 and k instead of, say, the set {1, 2, 4, . . . , 2k}? The total variation distance does not suffer from such ambiguities, and it is unclear how our choice of embedding affects the relationship to TV. The following result attacks precisely this question.

GERBER, JIANG, POLYANSKIY, AND SUN

Theorem 14. Let µ and ν be probability distributions supported on the set {x1, . . . , xk} Rd and let δ = mini =j xi xj . Then there exists a universal constant C > 0 such that

E2 1(µ, ν) Cδ

d TV2(µ, ν).

Proof Let µ = Pk i=1 µiδxi and ν = Pk i=1 νiδxi. Then, by (Ball, 1992, Theorem 1) we have

E2 1(µ, ν) = X

i,j (µi νi)(µj νj) xi xj Cδ

i=1 (µi νi)2 CδTV2(µ, ν)

Remark 15. Similar results can be proved for the generalized energy distance Eγ, using e.g. the work Narcowich and Ward (1992). However, to the best of our knowledge, these estimates degrade signiﬁcantly in the dimension d in contrast with Ball (1992).

Notice that by our discussion above, the support set {x1, . . . , xk} in Theorem 14 is arbitrary and may be chosen by us. Since the scale of the supporting points x1, . . . , xk is statistically irrelevant, we remove this ambiguity by restricting the points to lie in the unit ball, i.e. requiring that maxi xi 1. We see now that the comparison between E1 and TV improves as δ/

d grows. Given a ﬁxed value of δ, we want to make the dimension d of our embedding as low as possible, which means that the points x1, . . . , xk should form a large δ-packing of the d-dimensional unit ball. Due to well known bounds on the packing number of the Euclidean ball, it follows that the best one can hope for is log(k) d log(1/δ).

Maximizing δ/

d subject to this constraint yields the choice d = Θ(log(k)) and δ = Θ(1). This gives us the following corollary.

Corollary 16. There exists a universal constant C (0, ) such that for any k 1 there exists a set of points x1, . . . , xk R C log(k) with maxi xi 1 such that

for any two probability mass functions µ = (µ1, . . . , µk) and ν = (ν1, . . . , νk).

The question arises how the set of points x1, . . . , xk in Corollary 16 should be constructed. One solution is to use an error correcting code (ECC), whereby we take the xi to be the codewords of an ECC on the scaled hypercube 1

d{ 1}d for some dimension d (known as blocklength in this context). An ECC is asymptotically good if the message length log(k) is linear in the blocklength d, that is d log(k), and if the minimum Hamming distance between any two codewords is Θ(d), which translates precisely into δ = mini =j xi xj 1. Many explicit constructions of asymptotically good error correcting codes exist, see Justesen (1972) for one such example, and random codes are almost surely good (Barg and Forney, 2002). Clearly the better the code is, the better the constants we obtain in Corollary 16.

DENSITY ESTIMATION USING THE PERCEPTRON

Remark 17. One interesting consequence of Corollary 16 and the preceeding discussion is the following: given a categorical feature with k possible values, the perceptron may obtain better performance by identifying each category with the codewords x1, . . . , xk of an ECC instead of the standard one-hot encoding.

4. Density Estimation

In this section we apply what we ve learnt about the generalized energy distance and the perceptron discrepancy in prior sections, and analyze multiple problems related to density estimation.

4.1 Estimating Smooth Distributions and Gaussian Mixtures

Suppose that X1, . . . , Xn iid ν for some probability distribution ν on Rd. Given a class of generator distributions G and γ (0, 2), deﬁne the minimum-Eγ estimator as

νγ arg min ν G Eγ(ν , νn), (25)

where νn = 1

n Pn i=1 δXi. Note that νγ does not quite agree with our deﬁnition of νγ in (2), because the γ = 1 case minimizes the average halfspace distance d H E1 and not the perceptron discrepancy d H. The following two results bound the performance of ν as deﬁned in (25), as an estimator of ν for the smooth density class PS as well as the Gaussian mixture class PG. In Section 4.2 we present the adapation of these to d H, thereby proving Theorem 1.

Theorem 18. Let νγ be the estimator deﬁned in (25). For any β > 0, d 1 and C > 0, there exists a ﬁnite constant C1 = C1(β, d, C) so that

sup ν PS(β,d,C) ETV( νγ, ν) C1(nγ(2 γ)) β 2β+d+γ (26)

holds for G = PS(β, d, C) and any γ (0, 2). Similarly, for any d 1 there is a ﬁnite constant C2 = C2(d) such that

sup ν PG(d) ETV( νγ, ν) C2φ (nγ(2 γ)) 1/2 (27)

holds for G = PG(d) and any γ (0, 2), where φ(x) = x log(3 + 1/x) 2d+γ

Proof Let us focus on the case G = PS(β, d, C) ﬁrst and let t = 2β+d+γ

2β . The inequality Eγ( νγ, νn) Eγ(ν, νn) holds almost surely by the deﬁnition of νγ. Writing C1 = C1(β, d, C)

GERBER, JIANG, POLYANSKIY, AND SUN

for a ﬁnite constant that we relabel freely, the ﬁrst claim is substantiated by the chain of inequalities

ETV( νγ, ν) Thm. 11 E h

C1 Eγ( νγ, ν) p

C1 Eγ(ν, νn) + Eγ( νγ, νn) p

Eq. (25) E h

2C1 Eγ(ν, νn) p

2C1 EEγ(ν, νn) p

Lem. 7 nγ(2 γ)

The result for G = PG follows analogously. Deﬁne r(x) = x log (3 + 1/x) t where t = 2d+γ

4 > 0. By direct calculation, one can check that r is strictly increasing and convex on R+. As a consequence, its inverse r 1 is strictly increasing and concave. Let C2 be a d-dependent ﬁnite constant which we relabel repeatedly. Similarly to the case of smooth distributions covered above, we obtain the chain of inequalities as follows:

ETV( νγ, ν) Thm.11 E h r 1

C2 Eγ( νγ, ν) p

Jensen s r 1

C2 EEγ( νγ, ν) p

Eqn. 25 r 1

2C2 EEγ(ν, νn) p

Lem.7 r 1 C2(nγ(2 γ)) 1/2 .

The conclusion follows by Lemma 26.

Notice that the rate of estimation of the minimum Eγ density estimator improves as γ 0, and in fact seems to approach the optimum. However, simultaneously, the effective sample size nγ shrinks. The best trade-off that we can derive by adaptively setting γ to be dependent on n is the following.

Corollary 19. For any β > 0, d 1 and C > 0, there exists a ﬁnite constant C1 = C1(β, d, C) such that: for all n 2, with νγn arg minν G Eγn (ν , νn) in the setting of (25) where γn = log(n) 1 and G = PS(β, d, C), one has:

sup ν PS(β,d,C) ETV( νγn, ν) C1

β 2β+d . (28)

DENSITY ESTIMATION USING THE PERCEPTRON

Similarly, for any d 1 there is a ﬁnite constant C2 = C2(d) such that for all n 3, with νγn arg minν G Eγn (ν , νn) where γn = log log(2n) 1 and G = PG(d), one has:

sup ν PG(d) ETV( νγ, ν) C2 log(n)d/2p

log log n/ n. (29)

4.2 Proof of Theorem 1 and Theorem 2

We already have everything needed to deduce Theorem 2. Since it is an exercise in combining results, we simply list the required steps:

1. Use Theorem 11 to get a comparison between TV and Eγ.

2. Set γ = 1 and use Theorem 4 to get the equivalence between E1 and d H.

3. Use Proposition 3 to get a comparison between d H and d H.

Turning to the proof of Theorem 1, we ﬁnd that it is completely analogous to the proof of Theorem 18, with the only difference being that we can no longer rely on Lemma 7 to show that the distance between empirical and population measures decays at the parametric rate, as the latter applies to Eγ instead of d H. However, the corresponding result for d H is well known.

Lemma 20. Let ν be a probability distribution on Rd and νn = 1

n Pn i=1 δXi for i.i.d. observations Xi ν. Then, for a ﬁnite universal constant C,

Ed H(ν, νn) C

Proof Follows for example from (Vershynin, 2018, 8.3.23) and the fact that Da, the family of halfspace indicators, has VC dimension d + 1.

With Lemma 20 in hand, completing the argument is straightforward: To deduce Theorem 1 follow the same steps as in the proof of Theorem 18, except use Theorem 2 and Lemma 20 in place of Theorem 11 and Lemma 7 respectively.

4.3 Estimating Discrete Distributions

In many practical machine learning tasks the data is discrete, albeit on a large alphabet [k] = {1, 2, . . . , k}: for example, in recommender systems the alphabet could be all possible ads, products or articles. A common idea to apply modern learning pipelines to such data is to use an embedding E : [k] Rd, with one-hot encoding (d = k) being the most popular choice. After such an embedding, the data is effectively made continuous and the density estimation methods as discussed previously can be applied. Can such an approach be good in the sense of minimax estimation guarantees? We answer this question positively in this section, provided that embedding E comes from an error-correcting code. Let Pk denote the set of all probability distributions on the set [k]. Suppose we observe an i.i.d. sample X1, . . . , Xn ν from some unknown distribution ν Pk. The problem of estimating ν is effectively trivial: the empirical distribution provides a minimax optimal estimator. Indeed, it is

GERBER, JIANG, POLYANSKIY, AND SUN

a folklore fact, see for example (Canonne, 2020, Theorem 1) or (Polyanskiy and Wu, 2023+, Exc. VI.8), that the optimal rate of estimation is given by

sup ν Pk ETV2 1 n

n, 1 . (30)

Recall from Section 3.3 that we may choose to embed the alphabet [k] into some higher dimensional Euclidean space. Given distinct points x1, . . . , xk Rd for some d 1, we can identify any distribution µ Pk with the probability distribution Pk i=1 µiδxi, where µi is the mass that µ puts on i [k].

Theorem 21. There exists a universal constant C < with the following property. For any alphabet size k there exist embedding points a1, . . . , ak R C log(k) such that given an i.i.d. sample X1, . . . , Xn ν from an unknown ν Pk, any estimator ν Pk that satisﬁes

i=1 νiδai, 1

for any c enjoys the performance guarantee

sup ν Pk ETV2( ν, ν) C min

Moreover, we may replace E1 by d H in (31) and the result (32) remains true with p

log(k) replaced by log(k).

Proof Let a1, . . . , ak Rd be the points deﬁned in Corollary 16 (relabeled from x1, . . . , xk for clarity) so that d log(k). By the triangle inequality we have

i=1 νiδai, 1

Lemma 7 c + maxi ai

By Corollary 16, the deﬁnition of ν and the triangle inequality it follows that

ETV2( ν, ν) k

log(k)(c + 1)

Noting the trivial fact that TV 1 completes the proof of the ﬁrst claim. Suppose now that we replace E1 by d H in the deﬁnition of ν. The proof follows analogously, using the chain of inequalities

Cor. 16 E1 Prop. 4

π(d 1)/4 d H

maxi ai 1 d H,

and Lemma 20 in place of Lemma 7, which is where we loose the

log(k) factor.

DENSITY ESTIMATION USING THE PERCEPTRON

As we explained, the empirical distribution ν = 1

n P i=1 δXi achieves optimality in (30), and clearly also achieves c = 0 in (31) i.e. minimizes the empirical risk globally. The point of Theorem 21 is to show that approximate minimizers, such as those found via SGD, are also nearly minimax optimal.

4.3.1 ESTIMATING H OLDER SMOOTH DENSITIES

Theorem 21 has interesting implications for density estimation over the class of distributions on the cube [0, 1]d with uniformly bounded derivatives up to order β β 1 and (β β)-H older continuous βth derivative; call such distributions simply β-H older smooth.5 Writing Bj for the cube with center (j 1

2)ϵ1/β and sidelength ϵ1/β where j {1, . . . , ϵ 1/β}d, it is known that

Bj (f(x) g(x))dx

[0,1]d |f(x) g(x)|dx + O(ϵ)

for any β-H older smooth densities f, g and an ϵ-independent constant c > 0, see for example (Arias Castro et al., 2018, Lemma 7.2) or (Ingster and Suslina, 2003, Proposition 2.16). In other words, discretizing such distributions using a regular grid with Ω(ϵ d/β) cells maintains total-variation distances up to an additive O(ϵ) error. Now, consider a multilayer perceptron , that is, a fully connected multilayer neural network with activations given by x 7 1{x 0}. Such a multilayer network with large enough hidden layers can in principle implement the discretization described above, and embed the ϵ d/β cells as an error correcting code. Thus, due to Theorem 21, the ERM density estimator (2) would achieve the minimax optimal density estimation rate n β/(2β+d) over β-H older smooth densities up to polylog factors provided the discriminator class D includes the aforementioned multilayer perceptron and has VC-dimension at most polylog in 1/ϵ. This observation essentially generalizes Theorem 1, which shows that if the discriminator class includes only the single layer (with one neuron) perceptron then the best possible rate is n β/(2β+d+1).

4.4 A Stopping Criterion for Smooth Density Estimation

As a corollary to our results, we propose a stopping criterion for training density estimators. Before doing so, let us record a result about the concentration properties of the empirical energy distance about its expectation.

Lemma 22. Let ν be supported on a compact subset Ω Rd, and let νn be its empirical measure based on n i.i.d. observations. For every γ (0, 2) there exists a constant C1 = C1(Ω, γ) > 0 such that

P Eγ(ν, νn) C1 n + t 2 exp nt2

In other words, Eγ(ν, νn) is O(1/n)-sub Gaussian.

Proof Recall the MMD formulation of the generalized energy distance from Section 2.3. The corresponding kernel is given by kγ(x, y) = x γ + y γ x y γ. Clearly

sup x,x ,y,y supp(ν) (kγ(x, y) kγ(x , y )) diam(Ω)γ.

5. Note that this class is not the same as PS, although related.

GERBER, JIANG, POLYANSKIY, AND SUN

Therefore, by Mc Diarmid s inequality we know that Eγ(ν, νn) is O(1/n)-sub Gaussian (note we don t track constants depending on Ωhere). From Lemma 7 we know that EEγ(ν, νn) 1/ n, and the conclusion follows.

Consider the following scenario: we have i.i.d. training data X1, . . . , Xn from some distribution ν and we are training an arbitrary generative model to estimate ν. Suppose that this training process gives us a sequence of density estimators {µk}k 1, which could be the result of, say, subsequent gradient descent steps on our parametric class of generators. Is there any way to ﬁgure out after how many steps K we may stop the training process? In other words, can we identify a value of K such that TV(ν, µK) is guaranteed to be less than some threshold with probability 1 δ? Note that an additional difﬁculty here is that our generative model for µk is able to generate the samples from µk but otherwise gives us no other access to µk. The fast (dimension-free) concentration properties of Eγ and the minimax optimality guarantees of its minimizer (whenever ν is smooth) make it an excellent choice for such a stopping criteria. Let ν PS(β, d, C) and let νn be its empirical version based on the n i.i.d. observations. Assume further that {µk}k 1 PS(β, d, C) is a sequence of density estimators based on the sample X1, . . . , Xn. Finally, given the training sample (X1, . . . , Xn), for each k let µk,mk = 1 mk Pmk i=1 δX(k) i be the empirical distribution of the sample (X(k) 1 , . . . , X(k) mk) µ mk k .

Proposition 23. For any β > 0, d 1 and γ (0, 2) there exists a constant c = c(β, d, γ) such that

TV(µk, ν) c

n + Eγ(µk,mk, νn)

! 2β 2β+d+γ , k 1

provided we take mk = cn log(k2/δ)/ log(1/δ).

Proof Let c = C1 where C1 is as in Lemma 22 and ﬁx δ (0, 1). Deﬁne the event A = Eγ(ν, νn) c n + q

and similarly

Ak = Eγ(µk,mk, µk) c mk + r ctk

for some sequence t1, t2, . . . , and each k 1. By Lemma 22,

P(Ak) = EP(Ak|X1, . . . , Xn) 2 exp( tk).

Taking tk = log(k2π2/(3δ)), the union bound gives

DENSITY ESTIMATION USING THE PERCEPTRON

By the inequality Eγ(µk, ν) Eγ(µk, µk,mk) + Eγ(µk,mk, νn) + Eγ(νn, ν) it follows that

k : Eγ(ν, µk) > Eγ(µk,mk, νn) + c n + c mk +

c log(k2π2/(3δ))

Thus, by choosing mk n log(k2/δ)/ log(1/δ) we can conclude that there exists a constant c

depending only on β, d, γ such that

Eγ(ν, µk) c r

n + Eγ(µk,mk, νn), k 1

The ﬁnal conclusion follows from Theorem 11.

Note that our bound on the probability holds for all k simultaneously, which is made possible by the fact that mk grows as k . The empirical relevance of such a result is immediate: suppose we have proposed candidate generative models µ1, µ2, . . . (e.g. one after each period of training epochs, or from different training models) that is trained on an i.i.d. dataset X1, . . . , Xn of size n from ν PS(β, d, C). A veriﬁer only needs to request for mk independent draws from the k th candidate, and if we ever achieve E(log n) 1 (µk,mk, νn) p

log(1/δ)/n we can stop training and claim by Theorem 18 that we are a constant factor away from (near-)minimax optimality with probability 1 δ.

5. Suboptimality for Two-Sample Testing

So far in this paper we have shown how the empirical energy distance minimizer, while being mismatched with the target total variation loss, nevertheless achieves nearly minimax optimal performance for density estimation tasks. Unfortunately, this surprising effect does not carry over to other statistical tasks, such as two-sample testing, which we describe in this section. The task of two-sample testing over a family of distributions P is the following. Given two samples (X, Y ) p n q m with unknown distribution, we need to distinguish between the hypotheses

H0 : p = q and p P, versus H1 : TV(p, q) > ε, and p, q P

with vanishing type-I and type-II error. The special case of m = is known as goodness-of-ﬁt testing, and for the class of smooth distributions it was famously solved by Ingster (1987), who showed that in dimension d = 1 the problem is solvable with probability 1 o(1) if and only if

n = ω(ϵ 2β+d/2

in which case a variant of the χ2-test works. The case of general m, n and d 1 was resolved in Arias-Castro et al. (2018) who showed that the problem is solvable if and only if (33) holds with n replaced by min{n, m}, using the very same χ2-test; see also Li and Yuan (2019). In the remainder of the section we focus on the m = n case for simplicity.

GERBER, JIANG, POLYANSKIY, AND SUN

In a recent paper (Paik et al., 2023), the following test statistic for two-sample testing was proposed:

Td,k(p, q) = max (w,b) Sd 1 [0, )

EX p w X b k

+ EY q w Y b k

where the arguments X, Y can be either discrete (e.g. via observed samples) or continuous densities. Note that here we take (a)0 + = 1{a 0} by convention. Speciﬁcally, the test proposed is to reject the null hypothesis when Td,k(pn, qn) tn, (34)

where pn = 1

n Pn i=1 δXi, qn = 1

n Pn i=1 δYi are empirical measures and the threshold that satisﬁes both tn = o(1) and tn = ω(1/ n). One of their main technical results (Paik et al., 2023, Theorem 6) asserts that the test (34) returns the correct hypothesis with probability 1 o(1) asymptotically as n for any qualifying sequence {tn}n 1 and ﬁxed p, q. However, this result leaves open questions about the sample complexity of their test, and in particular, whether it is able to achieve known minimax rates. It turns out that our results imply that their test, at least in the k = 0 case, cannot attain the optimal two-sample testing sample complexity (33) over the smooth class PS(β, d, C). To connect to our results, notice that

Td,0(p, q) = d H(p, q) .

Proposition 24. For all d, β > 0, there exists constants c = c(d, β), c = c (d, β) such that for all ε > 0, there exists probability density functions p, q supported on the d-dim unit ball such that

1. p β,2, q β,2 < c,

2. p q 1 p q 2 ε, and

3. the expected test statistic satisﬁes

E[Td,0(pn, qn)] c

for any n (log 1

ε) dε 2β+d+1

In other words, consistent testing using the statistic Td,0 is impossible with n = o(ε 2β+d+1

β ) samples, which is a far cry from the optimal sample complexity (33) attainable by the χ2 test. The proof of Proposition 24 is given at Section F.

6. Conclusion

We analyzed the simple discriminating class of afﬁne classiﬁers and proved its effectiveness in the ERM-GAN setting (2) within the Sobolev class PS(β, d) and Gaussian mixtures PG(d) with respect to the L2 norm (see Theorem 18 and corollary 19) and the total variation distance (see Theorem 1). Our ﬁndings afﬁrm the rate s near-optimality for the considered classes of PS and PG. Moreover, we present inequalities that interlink the Eγ, TV, and L2 distances, and demonstrate (in some cases) the tightness of these relationships via corresponding lower bound constructions (Section E). We

DENSITY ESTIMATION USING THE PERCEPTRON

also interpret the generalized energy distance in several ways that help advocate for its use in real applications. This work connects to a broader literature on the theoretical analysis of GAN-style models. An interesting question emerges about the interaction between the expressiveness and concentration of the discriminator class. We found that the class of afﬁne classiﬁers D1 is guaranteed to maintain some (potentially small) proportion of the total variation distance, and that it decays at the parametric rate between population and empirical distributions. Thus, we have traded off expressiveness for better concentration of the resulting IPM. As discussed in Section 1.2, Yatracos estimator lies at the other end of this discriminator expressiveness-concentration trade-off: the distance d Y is as expressive as total variation when restricted to the generator class G, but supν G Ed Y (ν, νn) decays strictly slower than 1/ n for nonparametric classes G. A downside compared to d H is that (i) the Yatracos class Y requires knowledge of G while our D1 is oblivious to G and (ii) the distance d Y is impractical to compute as it requires a covering of G. Our question is: is it possible to ﬁnd a class of sets S 2B(0,1) that lies at an intermediate point on this trade-off? In other words, does S exist such that the ERM ν deﬁned in (2) using the discriminator class D = S is optimal over, say, G = PS and the induced distance converges slower than 1/ n but faster than n β/(2β+d) between empirical and population measures? Would there be desiderata for a sample-efﬁcient discriminator that has neither full expressiveness against total variation and does not concentrate at a parametric rate?

Acknowledgments

Supported in part by the NSF grant CCF-2131115 and the MIT-IBM Watson AI Lab.

GERBER, JIANG, POLYANSKIY, AND SUN

Appendix A. Related works showing neural discriminator implies distance is small

In Section 1.2 we already reviewed at a high level existing results on GANs and density estimation. Here we provide full details and emphasize differences with our work.

A.1 Results on smooth pushforward densities

A range of works has focused on showing results for not smooth densities but smooth pushforwards g#U (for example U Unif([0, 1]d) and g is β-H older smooth). While the exact setting differs, it can be shown that at least for TV estimation on the open unit cube (or unit ball), smooth pushforward estimations imply smooth density estimation. Indeed, one can ﬁrst assume without loss of generality that the target density is bounded below via a linear transform f 1

2(unif +f) (see e.g. Lemma 32.18 in Polyanskiy and Wu (2023+)) and bounded above from smoothness. Furthermore, for any two (H older) β-smooth densities bounded above and below on (open) uniformly convex domains, the unique optimal transport map with respect to the quadratic cost is (β + 1)-smooth (see Lemma 10 in Chae et al. (2023) as well as Theorem 4.14 in Villani (2003)). Therefore, these rates could be compared to our results.

1. (Concurrent with ours) St ephanovitch et al. (2023), Model 1 and 3.

This model assumes that the true density is g#U for U uniform on d-dim torus and g is (β + 1)-H older smooth from the torus to Rp. The authors obtain optimal rates with respect to discriminator class consisting of γ-smooth functions for γ 1 (the case of γ = 1 matches with Wasserstein W1). Under the assumption that the target density is β-smooth, results for comparing W1 and TV (Chae (2024) and Corollary 4.4 in St ephanovitch et al. (2023)) apply. By substituting in the rate of W1 distance (γ = 1) in Theorem 5.8 (Model 3 therein) e O n β+1

2β+d with Chae and Walker (2020), one gets the optimal rate e O n β 2β+d . However, their discriminator crucially rely on knowledge of parameters n, β (see (4.5) therein)

2. Chae (2022)

The authors obtained results for W1 distance of smooth pushforwards where the observed samples are Gaussian-noised with variance σ. In the zero-noise case, their convergence results in W1 on β-H older smooth pushforward functions give slightly suboptimal rates of

O n β 2β+d (Theorem 3.2 therein) compared to the (conjectured) optimal rate O n β 2β+d 2

(Theorem 4.1 therein). As above, their results on W1 could be applied to total variation in this setting by known distance comparison inequalities. Furthermore, their discriminator is a large non-parametric class (not a neural network).

3. Belomestny et al. (2021)

The authors studied the GAN estimator on U Unif([0, 1]d) and the pushforward map g is Λ-regular and (β+1)-H older smooth (despite having the same minimax rate, this class do not cover all bounded β-smooth densities, see discussion after Theorem 3 therein). Their results (Theorem 2) obtained (log)optimal JS divergence rates, equivalent to TV since TV2 JS. In contrast to our work, their discriminator class is essentially a non-parametric class (of p(x) q(x)+p(x), where p, q range over pairs of smooth densities) that is approximated by giant

DENSITY ESTIMATION USING THE PERCEPTRON

neural networks, for which they assumed existence of an oracle approximating smooth discriminator functions over those networks (Belomestny et al. (2023)). Furthermore, this discriminator network needs to be larger than some function of β and n whereas ours is ﬁxed.

A.2 Results on smooth densities

There is also a large literature on GAN estimation of smooth densities (with slightly varying deﬁnitions) that are useful to review for contrasting with our own work.

1. Singh et al. (2018) and Uppal et al. (2019)

While these works focus on projection density estimators, Theorem 7 in Singh et al. (2018) and Theorem 9 in Uppal et al. (2019) have shown smooth density estimation rates on GANs. The authors have shown that if the discriminator class (functions of Sobolev smoothness and Besov smoothness s, resp.) is replaced with a neural network then the GAN estimator has optimal rate if the neural networks well approximate the smoothness class, for which known results such as Yarotsky (2017) applies. However, bounds on TV or W1 are not directly covered in their theorems.

2. Chen et al. (2020)

The authors considered H older-smooth densities on convex support that are bounded below.

They obtained sub-optimal rates O n γ 2γ+d n β+1 2(β+1)+d on the IPM for H older smooth

functions Hγ, γ 1 (where γ = 1 transfers to Lipschitz functions with W1 being the re-

spective IPM) versus the minimax rate O n β+γ

2 . They also had to rely on (oracle) neural network universal approximating discriminators that depends on β.

To conclude this section, we mention that there are also several recent works on non-GAN estimators that we omitted, such as (St ephanovitch et al., 2023, Model 2), (Liang, 2021, Theorem 4), Divol (2022) and Tang and Yang (2023). We recommend survey in Section 2.3 and Table 1 of St ephanovitch et al. (2023).

Appendix B. Auxiliary Technical Results

In this section we list some technical lemmas that are used in our later proofs.

Theorem 25 (Plancherel theorem). Let f, g L2(Rd). Then Z

Rd f(x)g(x)dx = 1 (2π)d

Rd bf(ω)bg(ω)dω.

Lemma 26. Suppose t, x, y > 0. Then there exist ﬁnite t-dependent constants C1, C2 such that

x y log(3 + 1/y)t = x log(3 + 1/x)t C1 y = x C2 y log(3 + 1/y)t.

Proof Let us focus on the ﬁrst implication. If x y, then it clearly holds. If y x y log(3 + 1/y)t then it sufﬁces to show

x log(3 + 1/x)t y log(3 + 1/y)t

log(3 + 1/(y log(3 + 1/y)t))t

GERBER, JIANG, POLYANSKIY, AND SUN

The second inequality is equivalent to

3 + 1/y (3 + 1/(y log(3 + 1/y)t)) t C1.

Now, if y 1/2 then clearly taking C1 = log3(5)t works. Suppose that instead y (0, 1/2). Then, since log grows slower than any polynomial, there exists a t-dependent constant ct < such that log(3 + 1/y) cty 1/(2t) for all y (0, 1/2). Therefore, we have

3 + 1 y log(3 + 1/y)t 3 + 1 ct ty1/2 .

It is then clear that

y 3 + 1 ct ty1/2

holds for all y (0, 1/2) if we take C1 large enough in terms of t. The second implication follows analogously and we omit its proof.

Lemma 27. Let µ be a probability distribution on Rd and γ (0, 2). Then

Rd (cos ω, X 1)2 + sin2 ω, X

ω d+γ dω 16πd/2Mγ(µ) Γ(d/2)γ(2 γ).

Proof We use the inequalities (cos t 1)2 + sin2(t) 4(t2 1) valid for all t R. Plugging in and using the Cauchy-Schwarz inequality, the quantity on the left hand side can be bounded as

Rd 1 ( ω 2 X 2)

ω d+γ dω 4 vold 1(Sd 1)E Z

1 rγ 1 dr + Z

X 1 1 r1+γ dr

= 16πd/2Mγ(ν)

where vold 1(Sd 1) = 2πd/2

Γ d 2 is the surface area of the unit (d 1)-sphere.

Lemma 28. For γ (0, 2) deﬁne

sup0<a<c R c a sin(ω)

ω dω if γ = 1,

sup0<a<c R c a cos(ω) ω(1+γ)/2 dω if γ (0, 1),

sup0<a<c R c a cos(ω) 1

ω(1+γ)/2 dω if γ (1, 2).

Then Bγ < .

DENSITY ESTIMATION USING THE PERCEPTRON

Proof In the γ > 1 case, one immediately has Bγ R 0

min{1,ω2/2}

ω(1+γ)/2 dω < . Now let us consider γ 1. One has that Bγ = sup0<c1<c2 |Iγ(c2) Iγ(c1)| supc>0 2|Iγ(c)| where

(R a 1 sin(ω)

ω dω if γ = 1, R a 1 cos(ω) ω(1+γ)/2 dω if γ (0, 1),

Clearly a 7 Iγ(a) is continuous on (0, ) for each γ (0, 2). Moreover,

|Iγ(a)| |a 1| max{a (γ+1)/2, 1}.

Therefore, we only need to show that lima |Iγ(a)| and lima 0 |Iγ(a)| are both ﬁnite. Let gγ(x) = x (γ+1)/2 and fγ(x) = sin(x) if γ = 1 and cos(x) otherwise, so that we may write Iγ(a) = R 1 a fγ(ω)gγ(ω)dω.

1. For a , since gγ( ) = 0 and | R a 1 fγ(x)dt| 2 is uniformly bounded, R 1 f(ω)g(ω)dω converges to a ﬁnite value by Dirichlet s test for improper integrals (Malik and Arora, 1992, page 391).6

2. For a 0, the conclusion follows by the inequality | sin(ω)| min{|ω|, 1} in the case γ = 1, and by |Iγ(a)| R 1 a ω (1+γ)/2dω R 1 0 ω (1+γ)/2dω = 2 1 γ for γ < 1.

Therefore, supa>0 |Iγ(a)| < which concludes the proof.

Lemma 29. Let R 0 dω limϵ 0 R

1/ϵ ω ϵ dω and recall the deﬁnition of ψγ from (17). Then, for x = 0 the following hold:

ψγ(x) = Cψγ

R 0 sin(ωx)

2 for γ = 1, R 0 cos(ωx) ω(1+γ)/2 dω for γ (0, 1), R 0 cos(ωx) 1

ω(1+γ)/2 dω for γ (1, 2),

cos( π(γ 1)

2 ) 1 if γ = 1,

1 π if γ = 1.

Proof For x = 0 clearly

w dw = sign(x) Z

w dw = sign(x)π

6. Reference pointed out by user Siminore on math.stackexchange.com.

GERBER, JIANG, POLYANSKIY, AND SUN

which shows the ﬁrst claim. Assume from here on without loss of generality that x > 0. For γ (0, 1), by the residue theorem, Z

cos(ωx) ω(1+γ)/2 dω = x(γ 1)/2 Z

= x(γ 1)/2 ℜ ie i π

z(1+γ)/2 dz

= x(γ 1)/2 cos π(γ 1)

Γ((1 γ)/2).

Similarly, for γ (1, 2), integration by parts and the residue theorem gives Z

ω(1+γ)/2 dω = x(γ 1)/2 Z

0 (cos(ω) 1)d 1 ((γ 1)/2) ω(γ 1)/2

= x(γ 1)/2 Z

sin(w) (γ 1)/2 ω(γ 1)/2 dω

= x(γ 1)/2 2 γ 1

= x(γ 1)/2 2 γ 1I ie π

2 (γ 1)/2 Z

z(γ 1)/2 dz

= x(γ 1)/2 2 γ 1 cos π(γ 1)

Γ(1 (γ 1)/2)

= x(γ 1)/2 cos π(γ 1)

Γ((1 γ)/2).

Lemma 30. Let φ be the probability density function of N(0, σId) and write bφ for its Fourier transform. Then, for any β 1,

bφ(ω) ω β 2 2 = πd/2

Γ(d/2)σ2β+d Γ 2β + d

Γ(d/2)σ2β+d

Proof It is well known that bφ(ω) = e σ2

2 ω 2. The claimed equality then follows from the formula for the 2β th moment of the Gaussian distribution with mean 0 and variance 1/(2σ2). The inequality follows by Lemma 31.

Lemma 31 (Properties of the gamma function). For all x > 1 the inequality Γ(x) 5(x/e)x 1/2

Proof In Minc and Sathre (1964) authors showed that log Γ(x) (x 1

2) log(x) x+ 1

2 log(2π)+1 for all x 1, from which the second claim follows as exp(1

2 log(2π) + 1/2) < 5.

DENSITY ESTIMATION USING THE PERCEPTRON

Lemma 32. Let b 1 and |a| < b. Then Z r

0 xa| sin(x)|bdx = O(ra+b)

as r , where we hide constants depending on a, b.

Proof Since we are only interested in the asymptotic behaviour as r , assume without loss of generality that r 1. Then, we have Z r

0 xa| sin(x)|bdx = Z 1

0 xa| sin(x)|bdx | {z } I

1 xa| sin(x)|bdx | {z } II

Using the inequality | sin(x)| x, we can bound the ﬁrst term as

0 xa+bdx = 1 a + b + 1 1 = O(ra+b),

since a + b > 0. For the second term, we obtain

a+1 if a = 1

log(r) if a = 1

where the last step uses a + b > 0 and b 1.

Lemma 33. Let a, b, c R with b > 0 be constants. For all large enough r one has Z

r xa exp bx log2(x + 2)

Proof Assume, without loss of generality, that c 0. For all large enough x one has

exp bx log2(x + 2)

Therefore, for large enough r, Z

r xa exp bx log2(x + 2)

r x c 2dx r c 1 < r c.

Lemma 34. Let Jν be the Bessel function of the ﬁrst kind of order ν.

1. For all x Rd, Z

Rd ei x,v dσ(v) = (2π)d/2 x 1 d/2Jd/2 1( x ).

GERBER, JIANG, POLYANSKIY, AND SUN

2. For any ν R, as x

2 πx cos x νπ

+ O(x 3/2). (35)

3. For all x Rd, Z

Bd(0,1) ei x,w dw = 2π

d/2 Jd/2( x ).

Proof For Item 1 set r = x and s = (2π) 1 in the calculation on page 154 of Stein and Weiss (1971). For Item 2 see (Watson, 1980, Eq. (1) in Section 7.21). For Item 3, we can compute Z

w 1 ei x,w dw = Z 1

1 ei x w1 Z

w2 2+ +w2 d 1 w2 1 dw2 . . . dwddw1

1 ei x w1(1 w2 1)(d 1)/2dw1. (36)

Recall from (Watson, 1980, Section 3.1) the deﬁnition of the Bessel function of the ﬁrst kind as

Jν(x) = (x/2)ν

0 cos(x cos(θ)) sin2ν(θ)dθ

valid for ν > 1/2; the above is also known as the Poisson representation. Changing variables to u = cos(θ), we see that it is equal to

Jν(x) = (x/2)ν

1 eixu(1 u2)ν 1/2du. (37)

Comparing (37) with (36) concludes the proof.

Lemma 35. There exists a radial function h0 L2(Rd) such that

supp(h0) B(0, 1),

|bh0(w)| C exp c w log( w + 2)2

for all w Rd,

2 for all w rmin,

where C, c, rmin > 0.

Proof Apply Theorem 1.4 in Cohen (2023) using the spherically symmetric weight function u : Rd R 0 deﬁned by

u(w) = u( w ) = w log( w + 2)2

where (a)+ := max(a, 0) for a R.

DENSITY ESTIMATION USING THE PERCEPTRON

Appendix C. Proof of Theorem 8

For v Sd 1 and b R let θv(x) = v, x and write ηv θv#(µ ν) for the pushforward of the measure µ ν through the map θv. To start with, we notice that Z

h Eψγ( X, v b) Eψγ( Y, v b) i2 dbdσ(v)

R ψγ(x b)dηv(x) 2 dbdσ(v), (38)

For each v Sd 1, the measure ηv has at most countably many atoms, therefore b 7 ηv({b}) = 0 Leb-almost everywhere. Then, by Tonelli s theorem we can conclude that ηv({b}) = 0 for σ Lebalmost every (v, b), thus going forward we can focus on the case x = b. By Lemma 29, and writing Aϵ = [ϵ, 1/ϵ] for ϵ > 0, we have

R ψγ(x b)dηv(x) = Cψγ

sin(ω(x b))

cos(ω(x b))

ω(1+γ)/2 if γ (0, 1)

cos(ω(x b)) 1

ω(1+γ)/2 if γ (1, 2)

Note that in the γ = 1 case we implicitly used that R dηv(x) = 0. To exchange the integral over x and the limit over ε, notice that for any ε > 0 and x = b R,

sin(ω(x b))

Bγ if γ = 1,

cos(ω(x b))

ω(1+γ)/2 dω

Bγ|x b|(γ 1)/2 if γ (0, 1),

cos(ω(x b)) 1

ω(1+γ)/2 dω

Bγ|x b|(γ 1)/2 if γ (1, 2).

where Bγ < depends only on γ and is deﬁned in Lemma 28. We now show that R

R |x b|(γ 1)/2d|ηv|(x) < for σ Leb-almost every b, v. To this end, let S = {(b, v) R Sd 1 : R

R |x b|(γ 1)/2d|ηv|(x) = } and assume for contradiction (σ Leb)(S) > 0. Then 1([ B,B] Sd 1) S 1S as B , and thus by the monotone convergence theorem there exists a ﬁnite B such that Leb(([ B, B] Sd 1) S) > 0. However, by Tonelli s theorem we have Z B

R |x b|(γ 1)/2d|ηv|(x) 2 db Z B

R |x b|γ 1d|ηv|(x)db

0 bγ 1dbd|ηv|(x)

R (B + |x|)γd|ηv|(x)

Bγ + EX µ [| v, X |γ] + EY ν [| v, Y |γ]

Bγ + Mγ(µ + ν),

GERBER, JIANG, POLYANSKIY, AND SUN

which, after integration over v Sd 1, leads to a contradiction if Mγ(µ + ν) < . Continuing under the assumption Mγ(µ+ν) < , we can apply the dominated convergence theorem to obtain

R ψγ(x b)dηv(x) = Cψγ lim ε 0

sin(ω(x b))

cos(ω(x b))

ω(1+γ)/2 if γ (0, 1)

cos(ω(x b)) 1

ω(1+γ)/2 if γ (1, 2)

Then by Fubini s theorem, we exchange the order of integration to get

R ψγ(x b)dηv(x) = Cψγ lim ε 0

sin(ω(x b))

cos(ω(x b))

ω(1+γ)/2 if γ (0, 1)

cos(ω(x b)) 1

ω(1+γ)/2 if γ (1, 2)

Notice that R

R e iωxdηv(x) = bηv(ω), bηv(ω) = bηv( ω) and bηv(0) = 0,

R ψγ(x b)dηv(x) = Cψγ lim ε 0

( I(e iωbbηv(ω)) if γ = 1

ℜ(e iωbbηv(ω)) if γ = 1

= Cψγ lim ε 0

I bΨγ,v,ε(b) if γ = 1

ℜ bΨγ,v,ε(b) if γ = 1.

where we write

Ψγ,v,ε(ω) = bηv(ω) ω(1+γ)/2 1{ω Aε}.

Notice that Ψγ,v,ε is bounded and compactly supported and thus lies in Lp(R) for any p, and so in particular

Ψγ,v,ε L1(R) L2(R),

which ensures that bΨγ,v,ε L (R) L2(R).

Finally, let us write

Ψγ,v(ω) = lim ϵ 0 Ψγ,v,ε(ω) = bηv(ω) ω(1+γ)/2 1{ω > 0}

for every ω. We now show that Ψγ,v L2(R) provided Mγ(µ + ν) < , which is assumed throughout. Let (X, Y ) µ ν. We have Z

R |Ψγ,v(ω)|2dω = Z

(E[cos ω, X cos ω, Y ])2 + (E[sin ω, X sin ω, Y ])2

DENSITY ESTIMATION USING THE PERCEPTRON

Using the inequality (a b)2 2(a 1)2 + (b 1)2, a, b R for the cos term, the inequality (a + b) 2a2 + 2b2, a, b R for the sin term, and applying Jensen s inequality to take the expectation outside, we can conclude that Ψγ,v L2(R) by Lemma 27. Thus, by the dominated convergence theorem

Ψγ,v,ϵ Ψγ,v 2 0

as ϵ 0. Then, by Parseval s identity

bΨγ,v,ϵ bΨγ,v 2 0 (39)

as ϵ 0. It is well know that convergence in L2(R) implies that there exists a subsequence {εn} n=1 with εn 0 and bΨγ,v,ϵ bΨγ,v almost everywhere.7 Therefore, by passing to this subsequence, it follows that

R ψγ(x b)dηv(x) = Cψγ

I bΨγ,v(b) if γ = 1

ℜ bΨγ,v(b) if γ = 1

for σ Leb-almost every (b, v) R Sd 1. Note that since ηv(ω) R,

ℜ bΨγ,v(b) = bΨγ,v(b) + bΨγ,v(b)

ω(1+γ)/2 eibω + bηv( ω)

ω(1+γ)/2 e ibω dω

bηv(ω) sign(ω)

|ω|(1+γ)/2 eibωdω

= F bηv(ω) sign(ω)

2|ω|(1+γ)/2

I bΨγ,v(b) = bΨγ,v(b) bΨγ,v(b)

bηv(ω) ω(1+γ)/2 e ibω bηv( ω)

ω(1+γ)/2 eibω !

bηv(ω) |ω|(1+γ)/2 sign(ω)eibωdω

bηv(ω) sign(ω)

2i|ω|(1+γ)/2

7. We could also conclude this by Carleson s theorem.

GERBER, JIANG, POLYANSKIY, AND SUN

Plugging (40) and (41) into (38), by Parseval s identity (implicitly using that Ψγ,v L2(R)), we obtain

h Eψγ( X, v b) Eψγ( Y, v b) i2 dbdσ(v) = Z

R ψγ(x b)dηv(x) 2 dbdσ(v),

4|ω|1+γ dwdσ(v)

|ω|1+γ dwdσ(v)

Rd |F[µ ν](ω)|2

where the last step uses a polar change of variable. The result follows after comparing with Proposition 5.

Appendix D. Proof of Proposition 9

Let S(Rd) be the Schwartz space and S (Rd) be the space of all tempered distributions on Rd. Let τ = µ ν and s = (d + γ)/2. First, note that

Z Ks(x)dx (1 + x 2)d < ,

so by (Landkof, 1972, Theorem 0.10) we have Ks S (Rd). By (Landkof, 1972, Theorem 0.12), since Ks S (Rd) and τ has compact support,

c Isf = \ Ks τ = c Ksbτ.

By Plancherel s identity,

(2π) d 2 Isτ 2 = c Isτ 2 = c Ksbτ 2 = 1 p

Fγ(d) Eγ(µ, ν),

where the last equality follows from Proposition 5.

Appendix E. Proof of Theorem 12 and Proposition 13

In this section we prove both Proposition 12 and Proposition 13. To do so, we give two constructions. The ﬁrst one, presented in Section E.1, only applies in one dimension and gives optimal results. The second construction is given in Section E.2 applies in all dimensions, but loses a polylogarithmic factor. Notation: Abusing notation, in what follows we write Eγ(f, g) and d H(f, g) even when f and g are not necessarily probability measures or probability densities. We will also write f t,2 = t bf 2 for potentially negative exponents t R. Note that Eγ(f, 0) = p

Fγ(d) bf d+γ

DENSITY ESTIMATION USING THE PERCEPTRON

E.1 The Case d = 1

The Lemma below constructs the difference of two densities that has favorable properties.

Lemma 36. Let f(x) = 1{|x| π} sin(rx) with r Z and write fβ = f f for f convolved with itself β 1 times, i.e. f1 = f, f2 = f f and so on. Fix an integer β 1 and let |t| < β. We have

fβ t,2 rt, fβ 1 1, and d H(fβ, 0) 1

as r where the constants may depend on β, t.

Proof The intuition for the estimates (42) is simple: most of the energy of f (and hence fβ) is at frequencies around |ω| r and thus differentiating t times boosts the L2-energy by rt. A simple computation shows bf(ω) = c( 1)r

i r ω2 r2 sin(ωπ). Note that because r Z we have bf bfβ 1. Estimating fβ t,2. By deﬁnition we have

0 | bf(ω)|2βω2t dω Z

(ω2 r2)2β ω2t sin2β(ωπ)dω.

We decompose the integral into three regimes:

1. ω < r/2: here (ω2 r2) r2 and thus

0 (. . . ) r 2β Z r/2

0 ω2t sin2β(ωπ) r2t

by Lemma 32.

2. ω > 3r/2: here (ω2 r2) ω2 and thus

3r/2 (. . . ) r2β Z

sin2β(ωπ)ω2t

ω4β r2βr2t 4β+1 = r1+2t 2β r2t.

3. ω [r/2, 3r/2]: here (ω2 r2) yr, where y = ω r. Note also sin(ωπ) = sin(rπ+yπ) = ( 1)r sin(yπ), and ω r. Thus

r/2 (. . . )dω = Z r/2

r/2 (. . . )dy r2β Z r/2

sin2β(yπ)r2t

(yr)2β dy r2t Z

where the last inequality follows by that the integrand is bounded at 0 and has y 2β y 2

Estimating fβ 1. Follows from fβ 1 fβ 2 1 by the Cauchy-Schwartz inequality and fβ 1 c fβ 1 by the Hausdorff Young inequality.

GERBER, JIANG, POLYANSKIY, AND SUN

Estimating d H. We get d H(fβ, 0) E1(fβ, 0) fβ 1,2 1

r from the ﬁrst estimate. For

the upper bound, note that \ sign(x) = 2 iω and d H(fβ, 0) = supb 1 2 R fβ(x) sign(x b)dx, so by Plancherel s identity,

d H(fβ, 0) sup b

Z c fβ(ω)eibω

(ω2 r2)β ω 1 sinβ(ωπ).

The fact that the above is O(1/r) follows analogously to the proof of our bound on fβ t,2 so we omit it. This concludes our proof.

Proof [Proof of Proposition 12 and Proposition 13 for d = 1] We now turn to showing tightness in one dimension, utilizing the density difference constructed in Lemma 36. Given a value of the smoothness β > 0, set β = β + 1 and let fβ be as in Lemma 36 with r = ϵ 1/β for some ϵ (0, 1). Let p0 be a smooth, compactly supported density with infx [ π,π] p0(x) > 0. Deﬁne

pϵ(x) = p0(x) + ϵfβ(βx)/2 and qϵ(x) = p0(x) ϵfβ(βx)/2.

Clearly both pϵ, qϵ are compactly supported probability densities for sufﬁciently small ϵ, since fβ < and is supported on [ βπ, βπ]. By Lemma 36, for each γ (0, 2) the two densities satisfy

pϵ qϵ 1 ϵ, pϵ β,2 qϵ β,2 1, Eγ(pϵ, qϵ) pϵ qϵ (1+γ)/2,2 ϵ

2β , d H(pϵ, qϵ) ϵ

This proves both Proposition 12 and Proposition 13 for d = 1.

E.2 The Case d > 1

We move on to the case of general dimension. In Section E.2.1 we outline our approach. Then, in Section E.2.2 we give full details of our construction, following the argument outlined in the prior section.

E.2.1 OVERVIEW

For the discussions below, we will assume that the ambient dimension d 2. Our construction here is less straightforward than for d = 1 in Section E.1 but shares the same basic idea. Recall that the basic premise is that we want to saturate the H older s inequality in eq. (22), which requires the density difference f = µ ν to have Fourier transform be (almost) supported on a sphere. For d = 1 we took f to be a pure sinusoid. However, of course such f is not compactly supported and that is why we multiplied the sinusoid by a rectangle (and then convolved many times to gain smoothness), which served as a molliﬁer. For d > 1 let us attempt to follow the same strategy and take

fr(x) = gr(x)h(x) ,

where r > 0 is a parameter, h is some compactly supported smooth molliﬁer and gr(x) is deﬁned implicitly via bgr(ω) = r(1 d)/2δ( ω r),

DENSITY ESTIMATION USING THE PERCEPTRON

where here and below we denote, a bit informally, by δ( r) a distribution that integrates any smooth compactly supported function φ as follows: Z

Rd φ(ω)δ( ω r)dω rd 1 Z

Rd φ(rω)dσ(ω) = 2πd/2rd 1

2) E[φ(r X)] ,

where σ is the unnormalized surface measure of Sd 1 and X is a random vector uniformly distributed on Sd 1. Explicit computation shows

gr(x) = F 1[bgr](x) = r (2π)drd/2

Rd ei ω,x δ( ω r)dω

= r (2π)d/2 x 1 d/2 Jd/2 1( rx ),

where Jν denotes Bessel functions of the ﬁrst kind of order ν. Notice that g is spherically symmetric and real-valued (some further properties of it are collected below in Lemma 34). Note that |gr(x)| = O(1) as r for any ﬁxed x = 0 (Lemma 34), while at the origin we have |gr(0)| = Ω(r(d 1)/2), which follows from the series expansion of the Bessel function given in for example (Watson, 1980, Section 3.1-3.11). This causes an issue for d > 1, as gr is too large at the origin as r compared to its tails, which makes it difﬁcult to use it as the difference between two probability densities. Hence, we choose our molliﬁer h to be supported on an annulus instead of on a ball. In addition, it will also be convenient for it to have a super-polynomially decaying Fourier transform, i.e.

|bh(w)| H( w ) C exp c w log( w + 2)2

The existence of the desired function h is proven Lemma 37. Note that all of the Fourier energy of gr lies at frequencies ω = r by construction. However, after multiplying by h the energy spills over to adjacent frequencies as well and we need to estimate the amount of the spill. Due to the fast decay of bh we will show, roughly, the following estimates on the behavior of bfr:

| bfr(ω)| O(r(1 d)/2)1{ ω r log2(r)} + r(d 1)/2H(max( ω r, log2 r)), and

| bfr(ω)| r(d 1)/2 ω

as r . Note that the ﬁrst bound above is super-polynomially decaying in both ω and r, which allows us to show that fr t,2 O(rt)

for t > d+2

2 , recalling the notation f t,2 = t bf 2. A direct calculation will also show

fr 1 fr fr 2 1 .

For a desired total-variation separation ϵ, we will set µ ν = ϵfr and choose r = ϵ 1/β to ensure that ϵ fr β,2 = O(1). For the energy distance between µ and ν these choices yield

Eγ(µ, ν) ϵfϵ 1/β d+γ

2 ,2 = O(ϵ1+ d+γ

2β ) = O(TV

as required. We now proceed to rigorous details.

GERBER, JIANG, POLYANSKIY, AND SUN

E.2.2 THE CONSTRUCTION

First, we must construct the molliﬁer h with the properties outlined in Section E.2.1. Recall that a function f is radial (also known as spherically symmetric) if its value at x Rd depends only on x . In other words, f(x) = f(y) holds for all x, y Rd with x = y .

Lemma 37. There exists a compactly supported radial Schwartz function h, and a positive sequence {rn} n=1 satisfying rn = Θ(n), such that

supp(h) B(0, 1), (43)

supp(h) Rd \ B(0, r0), (44)

|bh(w)| C exp c w log( w + 2)2

for all w Rd, and (45)

bh(rnu) = 0 for all u Sd 1, (46)

for universal constants C, c, r0 > 0.

Proof First, let h0 be as constructed in Lemma 35, which already satisﬁes eq. (43) and Equation (45). To address the other two requirements, we modify h0 by convolving it with two additional terms:

h(x) := (A0( ) h0(8 ) ρ0( ))(x),

where A0 and ρ0 aim to address Equation (44) and Equation (46), respectively, and are deﬁned as

A0(x) = exp 1 1/64 ( x 1/2)2

1 x (3/8, 5/8) , ρ0(x) = 1{ x < 1/8}.

Before proceeding, note that clearly h is a radial Schwartz function. Let us now verify that h indeed satisﬁes the four requirements. Note that A0 is an annulus supported on B(0, 5/8)\B(0, 3/8), and both h0(8 ) and ρ0 are supported on B(0, 1/8). Therefore, supp(h) B(0, 7/8)\B(0, 1/8), which implies Equations (43) and (44). We now turn to the other two conditions in Fourier space. Note that bh(w) = (1/8)d b A0(w) bh0(w/8) bρ0(w).

From Item 3 of Lemma 34 we know that

F[1{ < 1}](w) = 2π

Hence, by Item 2 of Lemma 34, the function bρ0(w) = (1/8)d F[1{ < 1}](w/8) has inﬁnitely many zeros near the values of w = 8(2nπ + (d+1)π

4 ) for sufﬁciently large n Z+, which implies Equation (46). Finally, for Equation (45), note that since both A0 and ρ0 are Schwartz functions, so are their Fourier transforms b A0 and bρ0 so that

|bh(w)| (1/8)d b A0 bρ0 |bh0(w/8)| |bh0(w/8)|,

concluding the proof.

DENSITY ESTIMATION USING THE PERCEPTRON

Let h be as constructed in Lemma 37, and deﬁne

fr = grh (47)

for r > 0 and bg(ω) r(1 d)/2δ( ω r). Recall from the overview of our construction that we gave in Section E.2.1 that fr is our proposed density difference which we claim (approximately) saturates H older s inequality in (22). The next Lemma records the properties of fr which will enable us to complete our proof.

Lemma 38. Let fr be as in (47) and let {rn} n=1 be the sequence constructed in Lemma 37. The following hold.

(i) For all n N we have Z

Rd frn(x)dx = 0 and supp(frn) B(0, 1).

(ii) We have frn frn 2 frn 1 1,

hiding constants independent of n.

(iii) For any t > d+2

2 we have frn t,2 = O(rt n logd(rn))

as n , hiding constants independent of n.

(iv) Recall the deﬁnition of ψγ from (17). For any γ (0, 2) we have

sup v Sd 1,b R

Rd ψγ( x, v b)frn(x)dx = O(r (d+γ)/2 n log(rn)d)

hiding constants independent of n.

Proof Let us drop the dependence of rn to simplify notation. Showing (i). Note that R

Rd fr(x)dx = bfr(0). Then, bfr(0) = 0 follows from the construction of h and gr. Indeed, bgr is supported on r Sd 1 while bh|r Sd 1 0. The fact that supp(fr) B(0, 1) follows from supp(h) B(0, 1).

Showing (ii). Since fr has compact support, we immediately have

fr 1 fr 2 fr .

As h is continuous and supported on the annulus {x : r0 x 1} by construction, it sufﬁces to bound gr on said annulus. Now, for any x with r0 x 1, we have by Lemma 34 that

gr(x) r x 1 d/2 1 p

GERBER, JIANG, POLYANSKIY, AND SUN

which shows that fr 1. We now turn to lower bounding fr 1. Recall that h is uniformly continuous and nontrivial, hence R |h(u v)|dσ(v) = 0 for some radius u , and thus for all u (u0, u1) (0, 1) for some constants u0, u1. Using that gr is spherically symmetric, we compute

Rd |gr(x)||h(x)|dx

0 ud 1gr(u, 0, . . . , 0) Z h(uv)dσ(v)du

Jd/2 1(ru) du 1,

where the last line follows by (35) once again.

Showing (iii). Let 0 < s < r, whose precise value will be set later. For convenience, set Bs = {x Rd : x s} and Bc s = Rd \ Bs. Recall that by deﬁnition

bfr(ω) = r(1 d)/2 Z

Rd bh(ω + x)δ( x r)dx

= r(1 d)/2 Z

Rd(bh1Bs)(ω + x)δ( x r)dx | {z } I

+ r(1 d)/2 Z

Rd(bh1Bcs)(ω + x)δ( x r)dx | {z } II

Let C, c be as in Lemma 37, and H(x) = C exp( c x / log2( x + 2)). Note that bh C. Therefore, the ﬁrst term in the decomposition (48) can be bounded by

|I| Cr(1 d)/2 Z

Rd 1{ ω + x s}δ( x r)dx

= Cr(1 d)/21{ ω [r s, r + s]} Z

Rd 1{ ω + x s}δ( x r)dx

r(1 d)/2sd 11{ ω [r s, r + s]}.

The second line uses that if ω [r s, r + s] then the integral becomes zero. The third line uses the fact that the surface area of the intersection of Bs with any sphere of any radius (and the one centered at ω with radius r in particular) is at most O(sd 1). Moving on to the second term, we have

|II| = r(d 1)/2 Z |(bh1Bcs)(ω + ru)|dσ(u) r(d 1)/2H(max{ ω r, s})

using that H : [0, ) (0, C] is decreasing and that |bh(y)1Bcs(y) H(max{y, s}) for all y Rd. Summarizing, we have the pointwise estimate

| bfr(ω)| r(1 d)/2sd 11{ ω [r s, r + s]} + r(d 1)/2H(max{ ω r, s}) (49)

for all ω Rd and 0 < s < r.

DENSITY ESTIMATION USING THE PERCEPTRON

We now show that fr is Lipschitz continuous. Recall from the construction of h (Lemma 37) that h|rn Sd 1 0. Then, we observe that for any ω Rd

| bfr(ω)| = r(d 1)/2

Z bh(ω + ru)dσ(u)

Z {bh(ω + ru) bh(ru)}dσ(u)

= r(d 1)/2 bh Lip 2πd/2 ω

r(d 1)/2 ω , (50)

where we use that bh is Schwartz by construction, and thus has ﬁnite Lipschitz constant bh Lip. With (49) and (50) in hand we can proceed to bounding the norm of fr. Let s = D log(r)2 for a large constant D independent of r, and assume that r is large enough so that s < r/2. Also set θ > 0, whose precise value is speciﬁed later. We have

fr 2 t,2 = Z

Rd ω 2t| bfr(ω)|2dω

(50) rd 1 Z

ω r θ ω 2t+2dω + Z

ω >r θ ω 2t| bfr(ω)|2dω

(49) rd 1 θ(2t+d)

+ r1 d log(r)2(d 1) Z

ω >r θ ω 2t1{ ω [r s, r + s]}dω

ω >r θ ω 2t H2(max{ ω r, s})dω

rd 1 θ(2t+d) + r2t log(r)2d 1 + rd 1 Z

r θ u2t+d 1H2(max{u r, s})du.

Note that in the derivation above we changed to polar coordinates freely, and that in the second inequality we used the assumption t > d/2 1. Setting θ to any positive value greater than (d 1 2t)/(2t + d) ensures that the ﬁrst term in the ﬁnal line is O(r2t). As for the integral term, we can bound it by

rd 1H2(s) Z 2r

r θ u2t+d 1du + rd 1 Z

2r H2(u/2)du Lemma 33 poly(r) H2(s) + r2t.

By taking D large enough (independently of r) in the deﬁnition of s = D log2(r) we can make also the ﬁrst term poly(r) H2(s) less than O(r2t), which concludes the proof of (iii).

Showing (iv). The bounds that we develop below are analogous to those given in the proof of (iii). Fix b R and v Sd 1 and deﬁne

Rd ψγ( v, x b)fr(x)dx.

GERBER, JIANG, POLYANSKIY, AND SUN

Suppose ﬁrst that γ = 1. Then, using Lemmas 28 and 29, we know by dominated convergence that

ϵ Cψγ cos(t( v, x b)) 1{γ > 1}

t(1+γ)/2 fr(x)dtdx

= Cψγ lim ϵ 0

Rd eit( v,x b)fr(x)

t(1+γ)/2 dx

= Cψγ lim ϵ 0

cos(tb) bfr(tv)

t(1+γ)/2 dt.

Similarly, for γ = 1 we can compute

= Cψγ lim ϵ 0

sin( tb) bfr(tv)

In either case, we have | | R 0 | bfr(tv)|/t(1+γ)/2dt. Let s = D log2(r) for large D independent of r as in the proof of (iii), and let θ > 0 whose precise value is speciﬁed later. Assuming that r is large enough so that s < r/2, for any γ (0, 2) we have

| bfr(tv)| t(1+γ)/2 dt + Z

r θ | bfr(tv)| t(1+γ)/2 dt

(50) r(d 1)/2 Z r θ

0 t(1 γ)/2dt + Z

r θ | bfr(tv)| t(1+γ)/2 dt

r θ 1 t(1+γ)/2

r(1 d)/2sd 11{t [r s, r + s]} + r(d 1)/2H(max{t r, s}) dt

2 + r (d+γ)/2 logd(r) + H(s)r(d 1)/2 Z 2r

r θ dt t(1+γ)/2 + Z

H(t/2) t(1+γ)/2 dt

Lemma 33 r d 1

2 + r (d+γ)/2 logd(r) + H(s) poly(r) + r 100d, (51)

Set θ to any value greater than (2d + γ 1)/(3 γ), which ensures that the ﬁrst term in (51) is O(r (d+γ)/2). By taking D large enough in the deﬁnition of s = D log2(r), we can make H(s) smaller than any polynomial in r, which ensures that the third term in (51) is also O(r (d+γ)/2). We thus obtain the ﬁnal bound | | r (d+γ)/2 logd(r), concluding our proof.

Proof [Proof of Theorem 12 and Proposition 13 for d > 1] Using the functions {frn} n=1 we constructed in Lemma 38, we are ready to prove Propositions 12 and 13 for d > 1. Let p0 be a compactly supported probability density with inf x 1 p0(x) > 0. Fix the smoothness β > 0. Given any desired total variation separation ϵ (0, 1), we can ﬁnd n0 N such that ϵ 1/β rn0, where we hide an ϵ-independent multiplicative constant. Deﬁne

pϵ = p0 + ϵfrn0/2 and qϵ = p0 ϵfrn0/2.

DENSITY ESTIMATION USING THE PERCEPTRON

Clearly pϵ and qϵ are compactly supported probability densities for all small enough ϵ. Moreover, by Lemma 38 they satisfy

pϵ qϵ 1 ϵ and pϵ β,2 qϵ β,2 1 and

Eγ(pϵ, qϵ) ϵ

2β log(1/ϵ)d and d H(pϵ, qϵ) ϵ

2β log(1/ϵ)d

for all ﬁxed γ (0, 2). This concludes our proof.

Appendix F. Proof of Proposition 24

Proof Note that d H = Td,0. Let pϵ, qϵ be the compactly supported densities constructed in the proof of Proposition 12 in the general dimensional case. Then by construction

ε TV(pϵ, qϵ) pϵ qϵ 2 and pϵ β,2+ qϵ β,2 1 and d H(pϵ, qϵ) ϵ

β log(1/ϵ)d.

Write pϵ,n and qϵ,n for the empirical measures of pϵ and qϵ respectively, based on n i.i.d. observations each. By the triangle inequality we have

Ed H(pϵ,n, qϵ,n) Ed H(pϵ,n, pϵ) + d H(pϵ, qϵ) + Ed H(qϵ, qϵ,n)

Lemma 20 1/ n + ϵ

2β log(1/ϵ)d.

This completes the proof.

Ery Arias-Castro, Bruno Pelletier, and Venkatesh Saligrama. Remember the curse of dimensionality: The case of goodness-of-ﬁt testing in arbitrary dimension. Journal of Nonparametric Statistics, 30(2):448 471, 2018.

Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214 223. PMLR, 2017.

Yu Bai, Tengyu Ma, and Andrej Risteski. Approximability of discriminators implies diversity in gans. ar Xiv preprint ar Xiv:1806.10586, 2018.

Keith Ball. Eigenvalues of euclidean distance matrices. Journal of Approximation Theory, 68(1): 74 82, 1992.

Alexander Barg and G David Forney. Random codes: Minimum distances and error exponents. IEEE Transactions on Information Theory, 48(9):2568 2573, 2002.

Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In International conference on machine learning, pages 531 540. PMLR, 2018.

GERBER, JIANG, POLYANSKIY, AND SUN

Denis Belomestny, Eric Moulines, Alexey Naumov, Nikita Puchkin, and Sergey Samsonov. Rates of convergence for density estimation with generative adversarial networks. ar Xiv preprint ar Xiv:2102.00199, 2021.

Denis Belomestny, Alexey Naumov, Nikita Puchkin, and Sergey Samsonov. Simultaneous approximation of a smooth function and its derivatives by deep neural networks with piecewisepolynomial activations. Neural Networks, 161:242 253, 2023.

David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017.

Cl ement L Canonne. A short note on learning discrete distributions. ar Xiv preprint ar Xiv:2002.11457, 2020.

Minwoo Chae. Rates of convergence for nonparametric estimation of singular distributions using generative adversarial networks. ar Xiv preprint ar Xiv:2202.02890, 2022.

Minwoo Chae. Wasserstein upper bounds of lp-norms for multivariate densities in besov spaces. Statistics & Probability Letters, 210:110131, 2024.

Minwoo Chae and Stephen G. Walker. Wasserstein upper bounds of the total variation for smooth densities. Statistics & Probability Letters, 163:108771, 2020. ISSN 0167-7152. doi: https://doi. org/10.1016/j.spl.2020.108771. URL https://www.sciencedirect.com/science/ article/pii/S0167715220300742.

Minwoo Chae, Dongha Kim, Yongdai Kim, and Lizhen Lin. A likelihood approach to nonparametric estimation of a singular distribution using deep generative models. Journal of machine learning research, 24(77):1 42, 2023.

Minshuo Chen, Wenjing Liao, Hongyuan Zha, and Tuo Zhao. Distribution approximation and statistical estimation guarantees of generative adversarial networks. ar Xiv preprint ar Xiv:2002.03938, 2020.

Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. ar Xiv preprint ar Xiv:2209.11215, 2022.

Alex Cohen. Fractal uncertainty in higher dimensions. ar Xiv preprint ar Xiv:2305.05022, 2023.

Vincent Divol. Measure estimation on manifolds: an optimal transport approach. Probability Theory and Related Fields, 183(1):581 647, 2022.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012.

Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742 765, 2009.

DENSITY ESTIMATION USING THE PERCEPTRON

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020.

Ildar A Ibragimov and Rafail Z Khasminskii. Estimation of distribution density. Journal of Soviet Mathematics, 21:40 57, 1983.

Yu. I. Ingster. Minimax testing of nonparametric hypotheses on a distribution density in the lp metrics. Theory of Probability & Its Applications, 31(2):333 337, 1987. doi: 10.1137/1131042.

Yuri Ingster and Irina A Suslina. Nonparametric goodness-of-ﬁt testing under Gaussian models, volume 169. Springer Science & Business Media, 2003.

Zeyu Jia, Yury Polyanskiy, and Yihong Wu. Entropic characterization of optimal rates for learning gaussian mixtures. ar Xiv preprint ar Xiv:2306.12308, 2023.

Jørn Justesen. Class of constructive asymptotically good algebraic codes. IEEE Transactions on information theory, 18(5):652 656, 1972.

Arlene KH Kim and Adityanand Guntuboyina. Minimax bounds for estimating multivariate gaussian location mixtures. Electronic Journal of Statistics, 16(1):1461 1484, 2022.

Diederik P Kingma and Max Welling. An introduction to variational autoencoders. Foundations and Trends R in Machine Learning, 12(4):307 392, 2019.

Szymon Knop, Jacek Tabor, Przemyslaw Spurek, Igor Podolak, Marcin Mazur, and Stanislaw Jastrzebski. Cramer-wold autoencoder. 2018. doi: 10.48550/ARXIV.1805.09235.

Soheil Kolouri, Kimia Nadjahi, Shahin Shahrampour, and Umut Simsekli. Generalized sliced probability metrics. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4513 4517, 2022. doi: 10.1109/ICASSP43922.2022. 9746016.

Naum S. Landkof. Foundations of modern potential theory, volume 180. Springer, 1972.

Tong Li and Ming Yuan. On the optimality of gaussian kernel based nonparametric tests against smooth alternatives. ar Xiv preprint ar Xiv:1909.03302, 2019.

Tengyuan Liang. How well generative adversarial networks learn distributions. The Journal of Machine Learning Research, 22(1):10366 10406, 2021.

Subhash Chandra Malik and Savita Arora. Mathematical analysis. New Age International, 1992.

Youssef Marzouk, Zhi Ren, Sven Wang, and Jakob Zech. Distribution learning via neural differential equations: a nonparametric statistical perspective. ar Xiv preprint ar Xiv:2309.01043, 2023.

Henryk Minc and Leroy Sathre. Some inequalities involving (r!)1/r. Proceedings of the Edinburgh Mathematical Society, 14(1):41 46, 1964.

Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Sch olkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends R in Machine Learning, 10(1-2):1 141, 2017.

GERBER, JIANG, POLYANSKIY, AND SUN

Kimia Nadjahi, Alain Durmus, L ena ıc Chizat, Soheil Kolouri, Shahin Shahrampour, and Umut Simsekli. Statistical and topological properties of sliced probability divergences. Advances in Neural Information Processing Systems, 33:20802 20812, 2020.

Francis J Narcowich and Joseph D Ward. Norm estimates for the inverses of a general class of scattered-data radial-function interpolation matrices. Journal of Approximation Theory, 69(1): 84 109, 1992.

Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diffusion models are minimax optimal distribution estimators. ar Xiv preprint ar Xiv:2303.01861, 2023.

Seunghoon Paik, Michael Celentano, Alden Green, and Ryan J Tibshirani. Maximum mean discrepancy meets neural networks: The radon-kolmogorov-smirnov test. ar Xiv preprint ar Xiv:2309.02422, 2023.

Yury Polyanskiy and Yihong Wu. Information Theory: From Coding to Learning. Cambridge University Press, 2023+.

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684 10695, 2022.

Bernhard Sch olkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press, 06 2001. ISBN 9780262256933. doi: 10.7551/mitpress/4175.001.0001.

Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The annals of statistics, pages 2263 2291, 2013.

Shashank Singh, Ananya Uppal, Boyue Li, Chun-Liang Li, Manzil Zaheer, and Barnab as P oczos. Nonparametric density estimation under adversarial losses. Advances in Neural Information Processing Systems, 31, 2018.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020.

Elias M Stein and Guido Weiss. Introduction to Fourier analysis on Euclidean spaces, volume 1. Princeton university press, 1971.

Arthur St ephanovitch, Eddie Aamari, and Cl ement Levrard. Wasserstein gans are minimax optimal distribution estimators. ar Xiv preprint ar Xiv:2311.18613, 2023.

G abor J Sz ekely and Maria L Rizzo. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8):1249 1272, 2013.

DENSITY ESTIMATION USING THE PERCEPTRON

Rong Tang and Yun Yang. Minimax rate of distribution estimation on unknown submanifolds under adversarial losses. The Annals of Statistics, 51(3):1282 1308, 2023.

Stefan Tiegel. Hardness of agnostically learning halfspaces from worst-case lattice problems. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 3029 3064. PMLR, 12 15 Jul 2023. URL https://proceedings.mlr.press/v195/tiegel23a.html.

Ananya Uppal, Shashank Singh, and Barnab as P oczos. Nonparametric density estimation & convergence rates for gans under besov ipm losses. Advances in neural information processing systems, 32, 2019.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

C. Villani. Topics in Optimal Transportation. Graduate studies in mathematics. American Mathematical Society, 2003. ISBN 9780821833124. URL https://books.google.com/ books?id=idy FAw AAQBAJ.

Sven Wang and Youssef Marzouk. On minimax density estimation via measure transport. ar Xiv preprint ar Xiv:2207.10231, 2022.

George N Watson. A Treatise on the Theory of Bessel Functions. Cambridge University Press, 1980. ISBN 052106743X.

Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural networks, 94: 103 114, 2017.

Yannis G Yatracos. Rates of convergence of minimum distance estimators and Kolmogorov s entropy. The Annals of Statistics, 13(2):768 774, 1985.

Cheng Zhang, Judith B utepage, Hedvig Kjellstr om, and Stephan Mandt. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 41(8):2008 2026, 2018.