# density_estimation_using_the_perceptron__1c86bb7d.pdf Journal of Machine Learning Research 26 (2025) 1-51 Submitted 2/24; Revised 6/25; Published 7/25 Density Estimation Using the Perceptron Patrik R obert Gerber PRGERBER@MIT.EDU Department of Mathematics Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA 02139, USA Tianze Jiang TZJIANG@PRINCETON.EDU Operations Research and Financial Engineering Princeton University 98 Charlton St, Princeton, NJ, 08540, USA Yury Polyanskiy YP@MIT.EDU Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology 32 Vassar St, Cambridge, MA 02139, USA Rui Sun ERUISUN@STANFORD.EDU Department of Statistics Stanford University 450 Jane Stanford Way, Stanford, CA 94305, USA Editor: Aapo Hyvarinen We propose a new density estimation algorithm. Given n i.i.d. observations from a distribution belonging to a class of densities on Rd, our estimator outputs any density in the class whose perceptron discrepancy with the empirical distribution is at most O( p d/n). The perceptron discrepancy is defined as the largest difference in mass two distribution place on any halfspace. It is shown that this estimator achieves the expected total variation distance to the truth that is almost minimax optimal over the class of densities with bounded Sobolev norm and Gaussian mixtures. This suggests that the regularity of the prior distribution could be an explanation for the efficiency of the ubiquitous step in machine learning that replaces optimization over large function spaces with simpler parametric classes (such as discriminators of GANs). We also show that replacing the perceptron discrepancy with the generalized energy distance of Sz ekely and Rizzo (2013) further improves total variation loss. The generalized energy distance between empirical distributions is easily computable and differentiable, which makes it especially useful for fitting generative models. To the best of our knowledge, it is the first simple distance with such properties that yields minimax optimal statistical guarantees. In addition, we shed light on the ubiquitous method of representing discrete data in domain [k] via embedding vectors on a unit ball in Rd. We show that taking d log(k) allows one to use simple linear probing to evaluate and estimate total variation distance, as well as recovering minimax optimal sample complexity for the class of discrete distributions on [k]. Keywords: perceptron, halfspaces, GAN, density estimation, one-hot encoding, discrete distributions, total variation c 2025 Patrik R obert Gerber, Tianze Jiang, Yury Polyanskiy, and Rui Sun. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0261.html. GERBER, JIANG, POLYANSKIY, AND SUN 1. Introduction A standard step in many machine learning algorithms is to replace an (intractable) optimization over a general function space with an optimization over a large parametric class (most often neural networks). This is done in supervised learning for fitting classifiers, in variational inference (Blei et al., 2017; Zhang et al., 2018) for applying ELBO, in variational autoencoders (Kingma and Welling, 2019) for fitting the decoder, in Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Arjovsky et al., 2017) for fitting the discriminator, in diffusion models (Song et al., 2020; Chen et al., 2022) for fitting the score function, and many other settings. To be specific, let us focus on the example of GANs, which brought about the new era of density estimation in high-dimensional spaces. The problem setting is the following. We are given access to an i.i.d. data X1, . . . , Xn Rd sampled from an unknown distribution ν and a class of distributions G on Rd (the class of available generators ). The goal of the learner is to find arg minν G D(ν , ν), where D is some dissimilarity measure ( metric ) between probability distributions. In the case of GANs this measure is the Jensen-Shannon divergence JS(p, q) KL(p 1 2q) where KL(p q) = R p(x) log p(x) q(x)dx is the Kullback-Leibler divergence. As any f-divergence, JS has a variational form (see (Polyanskiy and Wu, 2023+, Example 7.5)): JS(p, q) = log 2 + suph:Rd (0,1) Ep[h] + Eq[log(1 h)] . With this idea in mind, we can now restate the objective of minimizing JS(ν , ν) as a game between a generator ν and a discriminator h, i.e., the GAN s estimator is ν arg min ν sup h:Rd (0,1) i=1 h(Xi) + Eν [log(1 h)] , (1) where we also replaced the expectation over (the unknown) ν with its empirical version νn 1 n Pn i=1 δXi. Subsequently, the idea was extended to other types of metrics, notably the Wasserstein GAN (Arjovsky et al., 2017), which defines ν arg min ν G sup f D EY ν f(Y ) 1 where the set of discriminators D is a class of Lipschitz functions (corresponding to the variational characterization of the Wasserstein-1 distance). The final step to turn (1) or (2) into an algorithm is to relax the domain of the inner maximization ( discriminator ) to a parametric class of neural network discriminators D. Note that replacing suph:Rd (0,1) with suph D effectively changes the objective from minimizing the JS divergence to minimizing a neural-JS , similar to how MINE (Belghazi et al., 2018) replaces the true mutual information with a neural one. This weakening is quite worrisome for a statistician. While the JS divergence is a strong statistical distance, as it bounds total variation from above and from below (Polyanskiy and Wu, 2023+, Eq. (7.39)), the neural-JS is unlikely to possess any such properties. How does one justify this restriction to a simpler class D? A practitioner would say that while taking maxh D restricts the power of the discriminator, the design of D is fine-tuned to picking up those features of the distributions that are relevant to the human eye.1 A theoretician, instead, 1. Implying in other words, that whether or not total variation TV( ν, ν) is high is irrelevant as long as the generated images look good enough to humans. DENSITY ESTIMATION USING THE PERCEPTRON would appeal to universal approximation results about neural networks to claim that restriction to D is almost lossless. The purpose of this paper is to suggest, and prove, a third explanation: the answer is in the regularity of ν itself. Indeed, we show that the restriction of discriminators to a very small class D in (1) results in almost no loss of minimax statistical guarantees, even if D is far from being a universal approximator. That is, the minimizing distribution ν selected with respect to a weak form of the distance enjoys almost minimax optimal guarantees with respect to the strong total variation distance, provided that the true distribution ν is regular enough. Phrased yet another way, even though the neural distance is very coarse and imprecise, and hence the minimizer selected with respect to it might be expected to only fool very naive discriminators, in reality it turns out to fool any arbitrarily complex, but bounded discriminator. Let us proceed to a more formal statement of our results. One may consult Section 1.3 for notation. We primarily focus on two classes of distributions on Rd: first, PS(β, d, C) denotes the set of distributions supported on the d-dimensional unit ball B(0, 1) that have a density with finite L2 norm and whose (β, 2)-Sobolev norm, defined in (11), is bounded by C; second, PG(d) = {µ N(0, 1) : supp(µ) B(0, 1)} is the class of Gaussian mixtures with compactly supported mixing distribution. We remind the reader that the total variation distance has the variational form TV(p, q) = suph:Rd [0,1] Eph Eqh . Our first result concerns the following class of discriminators: D1 = {x 7 1{x v b} : v Rd, b R} , (3) the class of affine classifiers, which can be seen as a single layer perceptron with a threshold nonlinearity. Theorem 1. For any β > 0, d 1 and C > 0, there exists a finite constant C1 = C1(β, d, C): sup ν PS(β,d,C) ETV( ν, ν) C1n β 2β+d+1 , (4) where the estimator ν is defined in (2) with D = D1 and G = PS(β, d, C). Similarly, for any d 1 there exists a finite constant C2 = C2(d) so that sup ν PG(d) ETV( ν, ν) C2 (log(n)) 2d+2 where the estimator ν is defined in (2) with D = D1 and G = PG(d). Recall the classical result (Ibragimov and Khasminskii, 1983) which shows that the minimax optimal estimation rate in TV over the class PS(β, d, C) equals n β/(2β+d) up to constant factors. Thus, the estimator in (4) is almost optimal, the only difference being that the dimension d is replaced by d + 1. Similarly, for the Gaussian mixtures we reach the parametric rate up to a polylog factor.2 2. For estimation of Gaussian mixtures in total variation the precise value of the minimax optimal polylog factor is at present unknown. However, for the L2 distance the minimax rate is known, and in the course of our proofs (see (24)) we show that our estimator only loses a multiplicative factor of log(n)1/4 in loss compared to the optimal L2-rate log(n)d/4/ n derived in (Kim and Guntuboyina, 2022). GERBER, JIANG, POLYANSKIY, AND SUN The proof of Theorem 1 relies on a comparison inequality between total variation and the perceptron discrepancy , or maximum halfspace distance, which we define as d H(µ, ν) sup f D1 {Eµf Eνf}. (5) Note first that d H TV clearly holds since all functions in the class D1 are bounded by 1. For the other direction, by proving a generalization of the Gagliardo-Nirenberg-Sobolev inequality we derive the following comparisons. Theorem 2. For any β > 0, d 1, there exists a finite constant C1 = C1(β, d): 2β C1 ( µ 2 β,2 + ν 2 β,2) d+1 4β d H(µ, ν) (6) holds for all µ, ν PS(β, d, ). Similarly, for any d 1 there exists a finite constant C2 = C2(d) such that TV(µ, ν) log 3 + 1 TV(µ, ν) 2 C2d H(µ, ν) holds for all µ, ν PG(d). We remark that we also show (in Proposition 12) that the exponent 2β+d+1 2β in (6) is tight, i.e. cannot be improved in general. With Theorem 2 in hand the proof of Theorem 1 is notably simple. For example, let us prove (4) (for full details, see Section 4.2). Recall that Xi iid ν, νn is the empirical distribution and ν = arg minν PS d H(ν , νn). We then have from the triangle inequality and minimality of ν: d H( ν, ν) d H( ν, νn) + d H(νn, ν) 2d H(νn, ν) . Thus, from Theorem 2 we have TV( ν, ν) d H(νn, ν) 2β 2β+d+1 . (7) Lastly, we recall that D1 is a class with finite VC-dimension and thus from uniform convergence (Theorem 8.3.23, Vershynin (2018)) we have for some dimension-dependent constant C(d) that E[d H(νn, ν)] C(d) n . Thus, applying expectation and Jensen s inequality to (7) we get E[TV( ν, ν)] E[d H(νn, ν) 2β 2β+d+1 ] E[d H(νn, ν)] 2β 2β+d+1 n β 2β+d+1 as claimed. While we believe that Theorem 1 provides theoretical proof for the efficacy of simple discriminators, it has several theoretical and practical deficiencies that we need to address. First, the guaran- tee in Theorem 1 for PS is strictly worse than the minimax optimal rate, which is O n β 2β+d (see e.g. Ibragimov and Khasminskii (1983)). DENSITY ESTIMATION USING THE PERCEPTRON Second, from the implementation point of view, computing the distance d H behind Theorem 1 is impractical. Indeed, finding the halfspace with maximal separation between even two empirical measures is a nonconvex, non-differentiable problem and takes super-poly time in the dimension d assuming P = NP (Guruswami and Raghavendra, 2009), and ω(dω(ε 1)) time for ε-optimal agnostic learning between two densities assuming either SIVP or gap SVP (Tiegel, 2023). Finally, even if we disregard the computational complexity, it is unclear how to minimize ν concerning arg minν d H(ν , νn) for given samples. This concern is alleviated by the fact that any ν satisfying d H( ν, νn) = O( p d/n) will work without degrading our performance guarantee, and thus only an approximate minimizer is needed. To address the above concerns, we make two changes to improve d H in the min-distance density estimator: (a) we replace the perceptron class D1 in (3) with a generalized perceptron Dγ for γ (0, 2) \ {1} defined as: Dγ = {x 7 |x v b| γ 1 2 : v Rd, b R} , γ (0, 2) \ {1} (8) (b) we replace the perceptron discrepancy d H (defined with respect to the best perceptron) with an average version d H defined in (13) (see (18) for the definition with general γ). Therefore, one does not even need to find an approximately optimal half-space, as random half-spaces provide sufficient discriminatory power. These changes are made precise in Section 2. Our Theorem 18 and Corollary 19 show that these two changes allow us to achieve a total variation rate of n β/(2β+d+γ) for the min distance density estimator. The improved rate comes to (within polylog(n)) minimax optimality as γ 0 adaptively with n, addressing our first concern. Somewhat unexpectedly, we discover that the average perceptron discrepancy d H exactly equals Sz ekely and Rizzo s energy distance E1 (Definition 1, Sz ekely and Rizzo (2013)), defined as E2 1(µ, ν) E 2 X Y X X Y Y , (X, X , Y, Y ) µ 2 ν 2 , (9) where is the usual Euclidean norm on Rd. Thus, our Theorem 18 (with γ = 1) shows that minimizing minν E1(ν , νn) gives a density estimator with rates over PS and PG as given in Theorem 1. Furthermore, for γ > 1 the corresponding average over Dγ results in the distance Eγ known as generalized energy distance, defined in the same paper. See Section 2 for details. This discovery addresses our second and final concern in the following sense. Treating our distance Eγ(eνgen. m , νtarget n ) as a loss function for solving eνgen. m , the closed-form computation of E via (9) requires only a polynomial O(n2 + m2) steps, and is friendly to gradient evaluations. Overall, our message from an algorithmic point of view is as follows: assuming one has access to a parametric family of generators sampling from νθ for parameters θ Rp, and if one can compute θ of the generator forward pass, e.g., via pushforward of a reference distribution under a smooth transport map or neural network-based models (Wang and Marzouk (2022); Marzouk et al. (2023)), then one can fit θ to the empirical sample νn by running stochastic gradient descent steps: sample m samples from νθ and form the empirical distribution ν m, compute the loss Eγ (ν m, νn) and backpropagate the gradient with respect to θ, update θ θ η θEγ (ν m, νn) for some step size η. Again, computing Eγ(ν m, νn) via (9) requires O(n2 + m2) steps and is gradient-friendly. Note, though, that obtaining and certifying good parameterizations for generators with densities satisfying Sobolev norm constraints is nontrivial and may require careful regularization. In this work, however, our focus will be on the discriminator part of GANs, as opposed to the generator. GERBER, JIANG, POLYANSKIY, AND SUN 1.1 Contributions This work focuses on studying the discriminator classes for adversarially estimating smooth densities, our main contributions are as follows. We show that β-smooth distributions, Gaussian mixtures and discrete distributions that are far apart in total variation distance must possess a halfspace on which their mass is substantially different (Theorems 2, 11 and 14). We apply the separation results to density estimation problems, showing that an ERM density estimator nearly attains the minimax optimal density estimation rate with respect to TV over the aforementioned distribution classes (Theorems 1, 18 and 21), suggesting that halfspaces indicators are (almost) sufficient for discriminator classes in GAN (2) in order to achieve (close to) minimax optimal density estimation in the considered classes of distributions. In Section 2 we show that the average halfspace separation distance d H is equal up to constant to the energy distance E1 (Theorem 4), which has many equivalent expressions: as a weighted L2-distance between characteristic functions (Proposition 5), as the sliced Cram er-2 distance (Theorem 8), as an IPM/MMD/energy distance (Section 2.3), and as the L2-norm of the Riesz potential (Proposition 9). We generalize the average halfspace distance d H to include an exponent γ (0, 2), corresponding to the generalized energy distance Eγ. Consequently, we discover that if instead of thresholded linear features 1{v x > b} we use the non-linearity |v x b|γ, smooth distributions and Gaussian mixtures can be separated even better (Theorem 11). Combined with the fact that Eγ, similarly to d H, decays between population and sample measures at the parametric rate (Lemma 7), the ERM for Eγ reduces the slack in the density estimation rate, achieving minimax log-optimality (Corollary 19). This result, combined with its strong approximation properties, supports its use in modern generative models (e.g. Ho et al. (2020); Goodfellow et al. (2014); Rombach et al. (2022); Ramesh et al.). Finally, Proposition 24 shows that recent work applying d H for two-sample testing is suboptimal over the class of smooth distributions in the minimax sense. 1.2 Related Work An important concept related to our work is the Integral Probability Metrics (IPMs) d IPM D (P, Q) sup f D |EPf EQf| (10) where D is the discriminator class. Examples include TV when D is the class of bounded functions and W1 when D is the class of Lipschitz functions. IPM distances represent the max separation between two distributions one gets from an adversary powered with D, and can be viewed as the generator objective in GAN (2). In terms of distance comparison inequalities on smooth density classes similar to our Theorem 2, the closest work we found was Chae and Walker (2020), in which the authors have shown comparison between TV W β/(β+1) 1 for (L1, β) Sobolev smooth densities on R. This inequality is also optimal in the exponent. Subsequently, Chae (2024) extended comparison inequality to Besov densities on Rd and Wp metrics. These results3 serve the same purpose as ours: they show that achieving optimal estimation in a weak norm (such as W1) implies optimal estimation rate under 3. We thank anonymous reviewers for pointing them to us. DENSITY ESTIMATION USING THE PERCEPTRON a strong norm. Since W1 corresponds to a non-parametric discriminator class of all Lipschitz functions, this result does not explain why simple discriminators work so well in practice. Another paper related to distance comparison is Bai et al. (2018) whose authors study comparison between the Wasserstein distance W1 and the IPM drelu defined by the discriminator class D = {x 7 Relu(x v + b) : b, v 1}. They show (Bai et al., 2018, Theorem 3.1) that p κ/d W1 drelu W1 for Gaussian distributions with mean in the unit ball, where κ is an upper bound on their condition numbers and d is the dimension. They obtain results for other distribution classes (Gaussian mixtures, exponential families), but for each of these they use a different class of discriminators that is adapted to the problem. In the density estimation literature, an estimator of the form (2) applied on smooth densities first appeared in the famous work of Yatracos (1985). Instead of indicators of halfspaces, they consider the class of discriminators Yϵ {1{dνi/dνj 1} : 1 i, j N(ϵ, G)}, where ν1, . . . , νN(ϵ,G) forms a minimal ϵ-TV covering of the class G and N(ϵ, G) is the so-called covering number. Writing d Y (µ, µ ) = supf Y(Eµf Eµ f), it is not hard to prove that |TV d Y | = O(ϵ) on G G and that Ed Y (ν, νn) p log N(ϵn, G)/n by a union bound coupled with a binomial tail inequality. From here ETV( ν, ν) p log N(ϵ, G)/n + ϵ follows by the triangle inequality (here ν is defined as in (2) with D = Y). Note that in contrast to our perceptron discrepancy d H, Yatracos estimator attains the optimal rate on G = PS, corresponding to the choice ϵ = ϵ(n) n β/(2β+d). In terms of GAN (or more generally any adversarial) density estimation, a series of papers (Singh et al. (2018), Uppal et al. (2019), Chen et al. (2020), Belomestny et al. (2021), Liang (2021), Chae (2022), St ephanovitch et al. (2023), etc) study the problem of density estimation over smoothness classes with respect to IPM distances. These works typically choose the discriminator class to be a large neural network of size growing with n, β and giving a universal approximation of some non-parametric discriminator class dependent on β. Given such a fine approximation to discriminator class, one obtains IPM error bounds (on e.g. W1), which is then converted to TV bound via comparison inequalities. In this paper, we stress again, the discriminator class is extremely small and weak (a collection of half-spaces) and does not depend on smoothness β or number of samples n. We provide more detailed literature review and discussion in Section A. Another paper by Oko et al. (2023) derives minimax density estimation guarantees for a class of diffusion-based estimators. Their result is similar to ours in that it obtains rigorous (near-)optimal guarantees for a method that is a realistic model of what is currently done in practice. However, their analysis also rely crucially on the universal approximation property of neural networks for the score function. In this present work, we mainly focus on the fixed and interpretable discriminator class {x 7 f(x v b) : v 1, b R} for a set of prescribed f and derive comparisons to TV for smooth distributions, Gaussian mixtures, and discrete distributions. In addition, we prove the (near)-optimality of our results (for smooth densities) and also derive nonparametric estimation rates for the corresponding GAN density estimators. Our work is, to the best of our knowledge, closest to the first that satisfies: 1. Uses a vanilla GAN-type approach ((2)) with ERM (which is close to practice) and obtains total variation rates that are (close to) minimax optimal. GERBER, JIANG, POLYANSKIY, AND SUN 2. Discriminator class is oblivious to the smoothness parameters: parts of our results are also noblivious, and the ones that depend on n only concern the activation function. Our proposed discriminator class is also very simple: we use a composition of a single non-linearity and a linear function (i.e. it is a VC class with dimension d + 1). The closest works to the above objectives are Belomestny et al. (2021) and St ephanovitch et al. (2023), model 3, both of which derived log-optimal TV risk exponent e O n β 2β+d via (2). However, they both relied on the universal approximation of neural networks which has to be larger than some function of β, the smoothness parameter, as well as n, the sample size. Hence another message, in contrast to the above works on discriminator networks, is that on an extremely simple class of discriminators parametrized by half-spaces (even with finite VC dimension) without the need for training/optimization, one can get almost optimal density estimation rates by averaging random halfspace distances (Theorem 18). Moreover, if any generator fits empirical to within O(1/ n) of our distance (this rate is oblivious to β), we get TV with high probability (Proposition 23). Independent of this work, recent results by Paik et al. (2023) investigate the halfspace separability of distributions for the setting of two-sample testing. However, their focus was on the asymptotic power of the test as the number of samples grows to infinity. Our lower bound construction presented in Section E proves that their proposed test is sub-optimal in the minimax setting. See Section 5 for a more detailed discussion. 1.3 Notation The symbols O, o, Θ, Ω, ω follow the conventional big-O notation, and O, o hide polylogarithmic factors. We use , and throughout our calculations to hide multiplicative constants that are irrelevant (depending on the context). Given a vector x Rd, we write x for its Euclidean norm and x, y x y for the Euclidean inner product of x, y Rd. The Gamma function is denoted by Γ. We write B(x, r) {y Rd : x y r}, Sd 1 {x Rd : x = 1} and σ for the unnormalized surface measure on Sd 1. The surface area of a unit (d 1)-sphere is also written as σ(Sd 1) = 2πd/2/Γ d 2 . In particular, if X is a random vector uniformly distributioned on Sd 1 then for any h we have E[h(X)] = 1 σ(Sd 1) Rd h(y)dσ(y) . The convolution between functions/measures is denoted by . We write Lp(Rd) for the space of (equivalence classes of) functions Rd C that satisfy f p R Rd |f(x)|pdx 1/p < . The space of all probability distributions on Rd is denoted as P(Rd). For a signed measure ν we write supp(ν) for its support and Mr(ν) R x rd|ν|(x) for its r th absolute moment. Given P, Q P(Rd) we write TV(P, Q) sup A Rd[P(A) Q(A)] for the total variation distance, where the supremum is over all measurable sets. Given a function f L1(Rd), define its Fourier transform as bf(ω) F[f](ω) Z Rd e i x,ω f(x)dx. Given a finite signed measure ν on Rd, define its Fourier transform as F[ν](ω) R Rd e i ω,x dν(x). We extend the Fourier transform to L2(Rd) and tempered distributions in the standard manner. DENSITY ESTIMATION USING THE PERCEPTRON Given f L2(Rd) and β > 0, define its homogenous Sobolev seminorm of order (β, 2) as Rd ω 2β| bf(ω)|2dω. (11) Further, we define two specific classes of functions of interest as follows: PS(β, d, C) is a set of smooth densities while PG(d) is a set of all Gaussian mixtures with support in the unit ball, formally PS(β, d, C) {µ P(Rd) : supp(µ) B(0, 1), µ has density p with p β,2 < C}, PG(d) {ν N(0, Id) : ν P(Rd), supp(ν) B(0, 1)}. Assumption 1. Throughout the paper we assume that C in the definition of PS(β, d, C) is large enough relative to β and d, such that PS(β, d, C/2) is non-empty. 1.4 Structure The structure of the paper is as follows. In Section 2 we introduce the generalized energy distance, the main object of our study. We show how it relates to the perceptron discrepancy d H and its relaxation d H; we record equivalent formulations of the generalized energy distance, one of which is a novel sliced-distance form. In Section 3, we present our main technical results on comparison inequalities between total variation and the energy distance. In Section 4 we analyse the density estimator that minimizes the empirical energy distance, and prove Theorem 1 and Theorem 2 in Section 4.2. In Section 5 we show that the use of d H for two sample testing results in suboptimal performance. We conclude in Section 6. All omitted proofs and auxiliary results are deferred to the Appendix. 2. The Generalized Energy Distance Given two probability distributions µ, ν on Rd with finite γ th moment, the generalized energy distance of order γ (0, 2) between them is defined as Eγ(µ, ν) = E h 2 X Y γ X X γ Y Y γi , where (X, X , Y, Y ) µ 2 ν 2. (12) As we alluded to in the introduction, the proof of Theorems 1 and 2 becomes possible once we relax the supremum in the definition of d H to an unnormalized average over halfspaces. In Section 2.1 we discuss this relaxation in more detail and identify a connection to the energy distance E1 defined above in (12). Motivated by this, we study the (generalized) energy distance and give multiple equivalent characterizations of it from Section 2.1 to Section 2.5. 2.1 From Perceptron Discrepancy to Energy Distance Our first goal is to connect the study of d H to the study of Eγ with γ = 1. To achieve this, we introduce an intermediary, the average perceptron discrepancy d H. Given two probability distributions µ, ν on Rd, we define v,x b dµ(x) dν(x) !2 dbdσ(v), (13) GERBER, JIANG, POLYANSKIY, AND SUN where σ denotes the surface area measure. If the two distributions µ, ν are supported on a compact set, then the overall definition can indeed be regarded as a mean squared version of perceptron discrepancy, because the integrals over b and v only range over bounded sets. However, in general, the integral over b in the definition of d H is not normalizable and that is why we put average in quotes. Nevertheless, we have the following comparisons between d H and d H. Proposition 3. For any β > 0, d 1, C > 0, and for all µ, ν PS(β, d, ), we have 4πd/2 d H(µ, ν) d H(µ, ν). (14) Moreover, for all d 1, there exists a finite constant C1 = C1(d) such that for all µ, ν PG(d), d H(µ, ν) log(3 + 1/d H(µ, ν))1/4 C1d H(µ, ν). Proof The proof of (14) is immediate after noting that all distributions in PS(β, d, ) are supported on the d-dimensional unit ball and that R v Sd 1 R 1 1 dbdσ(v) = 4πd/2/Γ(d/2). Thus, we focus on the Gaussian mixture case. Write µ ν = τ φ where φ denotes the density of the standard Gaussian N(0, Id) and τ P(Rd) is the difference of the two implicit mixing measures. For any R > 0, we have d H(µ, ν) sup v Sd 1,|b| R x,v b (τ φ)(x)dx v u u t 1 2R vold 1(Sd 1) x,v b (τ φ)(x)dx !2 dbdσ(v). Now, since τ is supported on a subset of B(0, 1) by definition of the class PG(d), for any v Sd 1 and R 2 we have the bound Rd φ(x y)dτ(y)dx x,v |b| exp( ( x 1)2/2)dx x |b| exp( x 2/8)dx exp( Ω(R2)), where we implicitly used that R dτ = 0 as τ is the difference of two probability distributions. Choosing R p log(3 + 1/d H(µ, ν)) concludes the proof. Proposition 3 implies that to obtain a comparison between TV and d H, specifically for lower bounding d H, it suffices to consider the relaxation d H instead. The next observation we make is that d H is in fact equal, up to constant, to the energy distance. DENSITY ESTIMATION USING THE PERCEPTRON Theorem 4. Let µ, ν be probability distributions on Rd with finite mean. Then d H(µ, ν) = π(d 1)/4 2 ) E1(µ, ν). Proof This is a direct implication of (15) and (18). We defer the proof to Theorem 8 which is its direct generalization as the interpretation of Eγ as the average Dγ distance for all γ (0, 2). As will be clear from the rest of the paper, it does pay off to study Eγ for general γ, even though so far we only justified its relevance to the results stated in the introduction for the case of γ = 1. With this in mind, we proceed to study various properties of the generalized energy distances {Eγ}γ (0,2). 2.2 The Fourier Form The formulation of the generalized energy distance that we rely on most heavily in our proofs is the following. Proposition 5 ((Sz ekely and Rizzo, 2013, Proposition 2)). Let γ (0, 2) and let µ, ν be probability distributions on Rd with finite γ th moment. Then, E2 γ(µ, ν) = Fγ(d) Z Rd |bµ(ω) bν(ω)|2 ω d+γ dω, (15) where we define Fγ(d) = γ2γ 1Γ( d+γ 2 ) πd/2Γ(1 γ Remark 6. Note that Fγ(d) = Θ γ(2 γ)Γ d+γ 2 π d/2 up to a universal constant. This shows that the generalized energy distance is a weighted L2 distance in Fourier space. The fact that Eγ is a valid metric on probability distributions with finite γ th moment is a simple consequence of Proposition 5. 2.3 The MMD and IPM Forms Another interpretation of the generalized energy distance is through the theory of Maximum Mean Discrepancy (MMD). Given a set X and a positive semidefinite kernel k : X 2 R, there is a unique reproducing kernel Hilbert space (RKHS) Hk consisting of the closure of the linear span of {k(x, ), x X} with respect to the inner product k(x, ), k(y, ) Hk = k(x, y). For a probability distribution µ on X, define its kernel embedding as θµ = R Rd k(x, )dµ(x). As shown in (Muandet et al., 2017, Lemma 3.1), the kernel embedding θµ exists and belongs to the RKHS Hk if E[ p k(X, X )] < for (X, X ) µ 2 as is the case for our kernel defined later in Equation (16). Then, given two probability distributions µ and ν, the MMD measures their distance in the RKHS by MMDk(µ, ν) θµ θν Hk. GERBER, JIANG, POLYANSKIY, AND SUN We refer the reader to Sch olkopf and Smola (2001); Muandet et al. (2017) for more details on the underlying theory. MMD has a closed form thanks to the reproducing property: MMD2 k(P, Q) = E h k(X, X ) + k(Y, Y ) 2k(X, Y ) i , where (X, X , Y, Y ) µ2 ν2. Moreover, it also follows that MMD is an Integral Probability Metric (IPM) where the supremum is over the unit ball of the RKHS HK: MMDk(µ, ν) = sup f Hk: f Hk 1 E[f(X) f(Y )]. In our case, we can define the kernel kγ(x, y) = x γ + y γ x y γ (16) for γ (0, 2), which in one dimension corresponds to the covariance operator of fractional Brownian motion. For a proof of the nontrivial fact that kγ above is positive definite see for example Sejdinovic et al. (2013). With the choice of kγ it follows trivially from its definition that the generalized energy distance Eγ is equal to the MMD with kernel kγ, i.e. Eγ(µ, ν) = MMDkγ(µ, ν) for all distributions µ, ν with finite γ th moment. It is noteworthy that while d H is by definition an IPM, so is its averaged version d H. A straightforward consequence of the above characterization is the fact that Eγ decays at the parametric rate between empirical and population measures. This is not terribly surprising as analogous results hold for arbitrary MMDs with bounded kernel, see for example (Gretton et al., 2012, Theorem 7). Recall that Mt(ν) denotes the t th absolute moment of the measure ν. Lemma 7. Let ν be a probability distribution on Rd and let νn = 1 n Pn i=1 δXi for an i.i.d. sample X1, . . . , Xn from ν. Then, for any γ (0, 2), EE2 γ(ν, νn) 2Mγ(ν) For a high-probability bound when ν is compactly supported, see Lemma 22. Proof Let X1, . . . , Xn be an additional i.i.d. sample from ν, and write νn for the corresponding empirical measure. Using the definition of Eγ in (12), we can compute EE2 γ(νn, νn) = E h 2 j=1 Xi Xj γ 1 j=1 Xi Xj γ j=1 Xi Xj γi n E X1 X2 γ. The conclusion then follows from taking the expectation of the expression E h E2 γ( νn, νn) νn i = E2 γ(ν, νn) + 1 n E X1 X2 γ DENSITY ESTIMATION USING THE PERCEPTRON and the inequality |x + y|γ 2max{0,γ 1}(|x|γ + |y|γ) for all x, y R. As a clarification remark, while IPM distances also appear in the formulation of GAN (2) with respect to the discriminator class D, our distance Eγ is not the IPM of Dγ, but rather that of the RKHS ball of kγ defined above. For γ = 1, the IPM for Dγ returns d H, which is lower bounded by (up to constant, see Proposition 3) d H = Θ(E1). For γ = 1, we leave the study of the IPM for Dγ as well as the characterization of the RKHS ball of kγ out of scope. Despite not being precisely the IPM of Dγ, our next characterization relates Eγ to Dγ in the same way d H is to D1 via a slicing perspective. 2.4 The Sliced Form Another equivalent characterization of the generalized energy distance is in the form of a sliced distance. Sliced distances are calculated by first choosing a random direction on the unit sphere, and then computing a one-dimensional distance in the chosen direction between the projections of the two input distributions. For γ (0, 2) define the function ( |x|(γ 1)/2 for γ = 1 1{x 0} otherwise. (17) The following result, to the best of our knowledge, has not appeared in prior literature except for the case of γ = 1. This is also the direct generalization of Theorem 4. Theorem 8. Let γ (0, 2) and let µ, ν be probability distributions on Rd with finite γ th moment. Then for (X, Y ) µ ν we have E2 γ(µ, ν) = 1 h Eψγ( X, v b) Eψγ( Y, v b) i2 dbdσ(v), (18) where Sγ = π d 2 +1Γ(1 γ γ2γ 1Γ( d+γ 2 ) cos2( π(γ 1) 2 )2 when γ = 1 and S1 = π d 1 The proof of Theorem 8 hinges on computing the Fourier transform of the function ψγ, which can be interpreted as a tempered distribution. We point out a special property of the integral on the right hand side of (18). After expanding the square, one finds that the individual terms in the sum are not absolutely integrable for γ = 1. However, due to cancellations within the squared quantity, the integral is finite. As claimed, using the language of Nadjahi et al. (2020), Theorem 8 allows us to interpret Eγ as a sliced probability divergence. Given v Sd 1, write θv = v, and θv#ν = ν θv for the pushforward of ν under θv. We have Sγ(d)E2 γ(µ, ν) = Sγ(1) Z Sd 1 E2 γ(θv#µ, θv#ν)dσ(v). We may also observe that the energy distance E1 is equal to the sliced Cram er-2 distance up to constant, which has been studied recently by both theoretical and empirical works (Knop et al., 2018; Kolouri et al., 2022).4 4. The Cram er-p distance is simply the Lp distance between cumulative distribution functions. GERBER, JIANG, POLYANSKIY, AND SUN 2.5 The Riesz Potential Form The generalized energy distance can also be linked to the Riesz potential (Landkof, 1972, Chapter 1.1), which is the inverse of the fractional Laplace operator. Given 0 < s < d, the Riesz potential Isf of a compactly supported signed measure f on Rd is defined (in a weak sense) by Isf = f Ks, where Ks(x) = c 1 s x s d and cs = πd/22sΓ(s/2)Γ((d s)/2) 1. The Fourier transform of the Riesz kernel is given by c Ks(ω) = ω s, interpreted as a tempered distribution. The following proposition is derived by setting s = d+γ 2 and using the Fourier form (Proposition 5) of the energy distance. Proposition 9. Let γ (0, min{d, 2}) and let µ, ν be compactly supported probability distributions on Rd. Then Eγ(µ, ν) = (2π)d/2q Fγ(d) I d+γ 2 (µ ν) 2. (19) 3. Main Comparison: TV Versus Energy After considering the connection of the perceptron discrepancy d H to the energy distance in Section 2, we turn to some of our main technical results, which provide novel quantitative comparisons between {Eγ}γ (0,2) and the total variation distance. In Section 3.1 we show that the generalized energy distance is upper bounded by total variation for compactly supported distributions. In Section 3.2 we derive lower bounds on the generalized energy distance in terms of the total variation distance over the two distribution classes that we have introduced, namely smooth distributions and Gaussian mixtures. Finally, in Section 3.3 we turn to the case of discrete distributions, which requires alternative techniques. 3.1 Upper Bound Arbitrary Compactly Supported Distributions Note that we (obviously) always have d H(µ, ν) TV(µ, ν) for arbitrary probability measures µ and ν. Moreover, for distributions supported on a unit ball we also have d H(µ, ν) d H(µ, ν). Therefore, by the identification of d H and E1 (Theorem 4), we can see that for distributions with bounded support, we always have E1(µ, ν) TV(µ, ν) . The next result generalizes this estimate for all Eγ, not just γ = 1. Proposition 10. For any dimension d 1 and γ (0, 2) there exists a finite constant c = c(d, γ) such that for any two probability distributions µ, ν supported on the unit ball we have Eγ(µ, ν) c TV(µ, ν) . Proof The proof directly follows from x y γ 2γ and so E2 γ(µ, ν) = E h 2 X Y γ X X γ Y Y γi = R x y γ(dµ(x) dν(x))(dµ(y) dν(y)) 2γTV(µ, ν)2, where (X, X , Y, Y ) µ 2 ν 2. DENSITY ESTIMATION USING THE PERCEPTRON 3.2 Lower Bound Smooth Distributions And Gaussian Mixtures In Section 3.1 we showed that the energy distance is upper bounded by total variation for compactly supported measures. In this section we look at the reverse direction, namely, we aim to lower bound the energy distance by total variation. Theorem 11. For any β > 0, d 1, there exists a finite constant C1 = C1(β, d) so that γ(2 γ)TV(µ, ν) 2β C1 ( µ 2 β,2 + ν 2 β,2) 4β Eγ(µ, ν) (20) for any µ, ν PS(β, d, ) and γ (0, 2). Similarly, for any d 1 there exists a finite constant C2 = C2(d) such that p γ(2 γ)TV(µ, ν) log(3 + 1/TV(µ, ν)) 2d+γ 4 C2Eγ(µ, ν) (21) for every µ, ν PG(d) and γ (0, 2). Proof Abusing notation, identify µ and ν with their Lebesgue densities. The argument proceeds trough a chain of inequalities: 1. Bound TV by the L2 distance between densities. 2. Apply Parseval s Theorem to pass to Fourier space. 3. Apply H older s inequality with well-chosen exponents. Proof of (20). Jensen s inequality implies that 2TV(µ, ν) = µ ν 1 p vol(B(0, 1)) µ ν 2 µ ν 2, where vol denotes volume and we discard dimension-dependent constants. This completes the first step of our proof. For the second step note that µ, ν L2(Rd) and we may apply Parseval s theorem to obtain µ ν 2 2 = 1 (2π)d bµ bν 2 2. For arbitrary ϕ > 0 and r [1, ], H older s inequality with exponents 1 r = 1 implies that bµ bν 2 2 = Z Rd |bµ(ω) bµ(ω)|2 ω ϕ Rd |bµ(ω) bν(ω)|2 ω ϕrdω 1/r Z Rd |bµ(ω) bν(ω)|2 ω ϕr dω 1/r . Now, we choose ϕ and r to satisfy ϕr = d + γ. The first equation ensures that the first integral term is bounded by µ ν 2/r β,2, which is assumed to be at most a d, β dependent constant. The second equation ensures that the second integral term is GERBER, JIANG, POLYANSKIY, AND SUN equal to (Eγ(µ, ν)2/Fγ(d))1/r by Proposition 5. The solution to this system of equations is given by r = (2β + d + γ)/(2β) and ϕ = 2β d+γ 2β+d+γ . Note that clearly ϕ > 0 and r 1. Thus, after rearrangement and using that Fd(γ) = Θ(γ(2 γ)) up to a dimension dependent constant, we obtain γ(2 γ) bµ bν 2β 2 C1 ( µ 2 β,2 + ν 2 β,2) 4β Eγ(µ, ν), for a finite constant C1 = C1(d, β), concluding the proof. Proof of (21). We write C(d) (0, ) for a dimension dependent constant that may change from line to line. The outline of the argument is analogous to the above, with the additional step of having to bound the (β, 2)-Sobolev norm of the Gaussian density as β for which we rely on Lemma 30. Let µ and ν have densities p φ and q φ, where φ is the density of N(0, Id). Writing f = (p q), we can extend the proof of (Jia et al., 2023, Theorem 22) to multiple dimensions to find, for any R > 2, that 2TV(µ, ν) = µ ν 1 = Z x R |(f φ)(x)| dx + Z Rd φ(x y)df(y) dx vold(B(0, R)) x R |(f φ)(x)|2dx + Z x >R exp( x 2/8)dx C(d) Rd/2 µ ν 2 + exp( Ω(R2)) , where the second line uses that supp(f) B(0, 1). Taking R p log(3 + 1/ µ ν 2) we obtain the inequality TV(µ, ν) C(d) µ ν 2 log(3 + 1/ µ ν 2)d/4. (23) By H older s inequality we obtain bf 2 ω β bf(ω) d+γ 2β+d+γ 2 = ω β bf(ω) d+γ 2β+d+γ 2 Eγ(µ, ν) 2β 2β+d+γ Fγ(d) β 2β+d+γ by Proposition 5. Using that | bf| |bφ| and applying Lemma 30, for β 1 we get β 2β+d+γ bf 2 Eγ(µ, ν) 2β 2β+d+γ 5πd/2 2 ! d+γ 2(2β+d+γ) . Rearranging and using Parseval s Theorem, we get Eγ(µ, ν) C(d) p γ(2 γ) f 2 f 2e (d+γ)(2β+d 1) DENSITY ESTIMATION USING THE PERCEPTRON for some d-dependent, albeit exponential, constant C(d) > 0. Plugging in β = log(3 + 1/ f 2) and assuming that f 2 is small enough in terms of d, we obtain Eγ(µ, ν) C(d) p log(3 + 1/ f 2) d+γ γ(2 γ)TV(µ, ν) log(3 + 1/TV(µ, ν)) 2d+γ where the second inequality uses (23) and Lemma 26. Theorem 11 is our main technical result, which shows that Eγ is lower bounded by a polynomial of the total variation distance for both the smooth distribution class PS and Gaussian mixtures PG. Note also that in one dimension, (20) follows from the Gagliardo Nirenberg-Sobolev interpolation inequality. However, to our knowledge, the inequality is new for d > 1. As for the tightness of Theorem 11, we manage to prove that this inequality is the best possible for PS in one dimension, and best possible up to a poly-logarithmic factor in dimension 2 and above. Proposition 12. For any β > 0, d 1, γ (0, 2) and C > 0 satisfying Assumption 1, there exists a finite constant C1 = C1(β, d, γ, C) so that for any value of ϵ (0, 1), there exist µϵ, νϵ PS(β, d, C) such that TV(µϵ, νϵ)/ϵ (1/C1, C1) and Eγ(µϵ, νϵ) C1TV(µϵ, νϵ) 2β log 3 + 1 TV(µϵ, νϵ) In the special case γ = 1 we obtain an even stronger notion of tightness. Proposition 13. When γ = 1 we may replace E1 by d H in Proposition 12. Proposition 13 is an improvement over Proposition 12 due to the inequality d H d H over the class PS(β, d, C), which follows from Proposition 3. It shows also that our construction has the property that there does not exist any halfspace that separates µ and ν better than our bounds suggest. The proofs of both results are presented in Section E. The general idea is to saturate H older s inequality in (22), for which the Fourier transform of f = dν dx should be supported on a sphere. However, such f clearly cannot be compactly supported. Thus the actual construction is to multiply the Fourier inverse of the uniform measure on a sphere with a compactly supported mollifier. In d > 1 the mollifier that we require must have super-polynomial Fourier spectrum decay, for which we use the recent construction in Cohen (2023). 3.3 Lower Bound Discrete Distributions Suppose we have two discrete distributions that are supported on a common, finite set of size k. One way to measure the energy distance between them would be to identify their support with the set {1, 2, . . . , k}, thereby embedding the two distributions in R, and applying the one-dimensional energy distance. While the above approach seems reasonable, it is entirely arbitrary. Indeed, there might not be a natural ordering of the support; moreover, why should one choose the integers between 1 and k instead of, say, the set {1, 2, 4, . . . , 2k}? The total variation distance does not suffer from such ambiguities, and it is unclear how our choice of embedding affects the relationship to TV. The following result attacks precisely this question. GERBER, JIANG, POLYANSKIY, AND SUN Theorem 14. Let µ and ν be probability distributions supported on the set {x1, . . . , xk} Rd and let δ = mini =j xi xj . Then there exists a universal constant C > 0 such that E2 1(µ, ν) Cδ d TV2(µ, ν). Proof Let µ = Pk i=1 µiδxi and ν = Pk i=1 νiδxi. Then, by (Ball, 1992, Theorem 1) we have E2 1(µ, ν) = X i,j (µi νi)(µj νj) xi xj Cδ i=1 (µi νi)2 CδTV2(µ, ν) Remark 15. Similar results can be proved for the generalized energy distance Eγ, using e.g. the work Narcowich and Ward (1992). However, to the best of our knowledge, these estimates degrade significantly in the dimension d in contrast with Ball (1992). Notice that by our discussion above, the support set {x1, . . . , xk} in Theorem 14 is arbitrary and may be chosen by us. Since the scale of the supporting points x1, . . . , xk is statistically irrelevant, we remove this ambiguity by restricting the points to lie in the unit ball, i.e. requiring that maxi xi 1. We see now that the comparison between E1 and TV improves as δ/ d grows. Given a fixed value of δ, we want to make the dimension d of our embedding as low as possible, which means that the points x1, . . . , xk should form a large δ-packing of the d-dimensional unit ball. Due to well known bounds on the packing number of the Euclidean ball, it follows that the best one can hope for is log(k) d log(1/δ). Maximizing δ/ d subject to this constraint yields the choice d = Θ(log(k)) and δ = Θ(1). This gives us the following corollary. Corollary 16. There exists a universal constant C (0, ) such that for any k 1 there exists a set of points x1, . . . , xk R C log(k) with maxi xi 1 such that for any two probability mass functions µ = (µ1, . . . , µk) and ν = (ν1, . . . , νk). The question arises how the set of points x1, . . . , xk in Corollary 16 should be constructed. One solution is to use an error correcting code (ECC), whereby we take the xi to be the codewords of an ECC on the scaled hypercube 1 d{ 1}d for some dimension d (known as blocklength in this context). An ECC is asymptotically good if the message length log(k) is linear in the blocklength d, that is d log(k), and if the minimum Hamming distance between any two codewords is Θ(d), which translates precisely into δ = mini =j xi xj 1. Many explicit constructions of asymptotically good error correcting codes exist, see Justesen (1972) for one such example, and random codes are almost surely good (Barg and Forney, 2002). Clearly the better the code is, the better the constants we obtain in Corollary 16. DENSITY ESTIMATION USING THE PERCEPTRON Remark 17. One interesting consequence of Corollary 16 and the preceeding discussion is the following: given a categorical feature with k possible values, the perceptron may obtain better performance by identifying each category with the codewords x1, . . . , xk of an ECC instead of the standard one-hot encoding. 4. Density Estimation In this section we apply what we ve learnt about the generalized energy distance and the perceptron discrepancy in prior sections, and analyze multiple problems related to density estimation. 4.1 Estimating Smooth Distributions and Gaussian Mixtures Suppose that X1, . . . , Xn iid ν for some probability distribution ν on Rd. Given a class of generator distributions G and γ (0, 2), define the minimum-Eγ estimator as νγ arg min ν G Eγ(ν , νn), (25) where νn = 1 n Pn i=1 δXi. Note that νγ does not quite agree with our definition of νγ in (2), because the γ = 1 case minimizes the average halfspace distance d H E1 and not the perceptron discrepancy d H. The following two results bound the performance of ν as defined in (25), as an estimator of ν for the smooth density class PS as well as the Gaussian mixture class PG. In Section 4.2 we present the adapation of these to d H, thereby proving Theorem 1. Theorem 18. Let νγ be the estimator defined in (25). For any β > 0, d 1 and C > 0, there exists a finite constant C1 = C1(β, d, C) so that sup ν PS(β,d,C) ETV( νγ, ν) C1(nγ(2 γ)) β 2β+d+γ (26) holds for G = PS(β, d, C) and any γ (0, 2). Similarly, for any d 1 there is a finite constant C2 = C2(d) such that sup ν PG(d) ETV( νγ, ν) C2φ (nγ(2 γ)) 1/2 (27) holds for G = PG(d) and any γ (0, 2), where φ(x) = x log(3 + 1/x) 2d+γ Proof Let us focus on the case G = PS(β, d, C) first and let t = 2β+d+γ 2β . The inequality Eγ( νγ, νn) Eγ(ν, νn) holds almost surely by the definition of νγ. Writing C1 = C1(β, d, C) GERBER, JIANG, POLYANSKIY, AND SUN for a finite constant that we relabel freely, the first claim is substantiated by the chain of inequalities ETV( νγ, ν) Thm. 11 E h C1 Eγ( νγ, ν) p C1 Eγ(ν, νn) + Eγ( νγ, νn) p Eq. (25) E h 2C1 Eγ(ν, νn) p 2C1 EEγ(ν, νn) p Lem. 7 nγ(2 γ) The result for G = PG follows analogously. Define r(x) = x log (3 + 1/x) t where t = 2d+γ 4 > 0. By direct calculation, one can check that r is strictly increasing and convex on R+. As a consequence, its inverse r 1 is strictly increasing and concave. Let C2 be a d-dependent finite constant which we relabel repeatedly. Similarly to the case of smooth distributions covered above, we obtain the chain of inequalities as follows: ETV( νγ, ν) Thm.11 E h r 1 C2 Eγ( νγ, ν) p Jensen s r 1 C2 EEγ( νγ, ν) p Eqn. 25 r 1 2C2 EEγ(ν, νn) p Lem.7 r 1 C2(nγ(2 γ)) 1/2 . The conclusion follows by Lemma 26. Notice that the rate of estimation of the minimum Eγ density estimator improves as γ 0, and in fact seems to approach the optimum. However, simultaneously, the effective sample size nγ shrinks. The best trade-off that we can derive by adaptively setting γ to be dependent on n is the following. Corollary 19. For any β > 0, d 1 and C > 0, there exists a finite constant C1 = C1(β, d, C) such that: for all n 2, with νγn arg minν G Eγn (ν , νn) in the setting of (25) where γn = log(n) 1 and G = PS(β, d, C), one has: sup ν PS(β,d,C) ETV( νγn, ν) C1 β 2β+d . (28) DENSITY ESTIMATION USING THE PERCEPTRON Similarly, for any d 1 there is a finite constant C2 = C2(d) such that for all n 3, with νγn arg minν G Eγn (ν , νn) where γn = log log(2n) 1 and G = PG(d), one has: sup ν PG(d) ETV( νγ, ν) C2 log(n)d/2p log log n/ n. (29) 4.2 Proof of Theorem 1 and Theorem 2 We already have everything needed to deduce Theorem 2. Since it is an exercise in combining results, we simply list the required steps: 1. Use Theorem 11 to get a comparison between TV and Eγ. 2. Set γ = 1 and use Theorem 4 to get the equivalence between E1 and d H. 3. Use Proposition 3 to get a comparison between d H and d H. Turning to the proof of Theorem 1, we find that it is completely analogous to the proof of Theorem 18, with the only difference being that we can no longer rely on Lemma 7 to show that the distance between empirical and population measures decays at the parametric rate, as the latter applies to Eγ instead of d H. However, the corresponding result for d H is well known. Lemma 20. Let ν be a probability distribution on Rd and νn = 1 n Pn i=1 δXi for i.i.d. observations Xi ν. Then, for a finite universal constant C, Ed H(ν, νn) C Proof Follows for example from (Vershynin, 2018, 8.3.23) and the fact that Da, the family of halfspace indicators, has VC dimension d + 1. With Lemma 20 in hand, completing the argument is straightforward: To deduce Theorem 1 follow the same steps as in the proof of Theorem 18, except use Theorem 2 and Lemma 20 in place of Theorem 11 and Lemma 7 respectively. 4.3 Estimating Discrete Distributions In many practical machine learning tasks the data is discrete, albeit on a large alphabet [k] = {1, 2, . . . , k}: for example, in recommender systems the alphabet could be all possible ads, products or articles. A common idea to apply modern learning pipelines to such data is to use an embedding E : [k] Rd, with one-hot encoding (d = k) being the most popular choice. After such an embedding, the data is effectively made continuous and the density estimation methods as discussed previously can be applied. Can such an approach be good in the sense of minimax estimation guarantees? We answer this question positively in this section, provided that embedding E comes from an error-correcting code. Let Pk denote the set of all probability distributions on the set [k]. Suppose we observe an i.i.d. sample X1, . . . , Xn ν from some unknown distribution ν Pk. The problem of estimating ν is effectively trivial: the empirical distribution provides a minimax optimal estimator. Indeed, it is GERBER, JIANG, POLYANSKIY, AND SUN a folklore fact, see for example (Canonne, 2020, Theorem 1) or (Polyanskiy and Wu, 2023+, Exc. VI.8), that the optimal rate of estimation is given by sup ν Pk ETV2 1 n n, 1 . (30) Recall from Section 3.3 that we may choose to embed the alphabet [k] into some higher dimensional Euclidean space. Given distinct points x1, . . . , xk Rd for some d 1, we can identify any distribution µ Pk with the probability distribution Pk i=1 µiδxi, where µi is the mass that µ puts on i [k]. Theorem 21. There exists a universal constant C < with the following property. For any alphabet size k there exist embedding points a1, . . . , ak R C log(k) such that given an i.i.d. sample X1, . . . , Xn ν from an unknown ν Pk, any estimator ν Pk that satisfies i=1 νiδai, 1 for any c enjoys the performance guarantee sup ν Pk ETV2( ν, ν) C min Moreover, we may replace E1 by d H in (31) and the result (32) remains true with p log(k) replaced by log(k). Proof Let a1, . . . , ak Rd be the points defined in Corollary 16 (relabeled from x1, . . . , xk for clarity) so that d log(k). By the triangle inequality we have i=1 νiδai, 1 Lemma 7 c + maxi ai By Corollary 16, the definition of ν and the triangle inequality it follows that ETV2( ν, ν) k log(k)(c + 1) Noting the trivial fact that TV 1 completes the proof of the first claim. Suppose now that we replace E1 by d H in the definition of ν. The proof follows analogously, using the chain of inequalities Cor. 16 E1 Prop. 4 π(d 1)/4 d H maxi ai 1 d H, and Lemma 20 in place of Lemma 7, which is where we loose the log(k) factor. DENSITY ESTIMATION USING THE PERCEPTRON As we explained, the empirical distribution ν = 1 n P i=1 δXi achieves optimality in (30), and clearly also achieves c = 0 in (31) i.e. minimizes the empirical risk globally. The point of Theorem 21 is to show that approximate minimizers, such as those found via SGD, are also nearly minimax optimal. 4.3.1 ESTIMATING H OLDER SMOOTH DENSITIES Theorem 21 has interesting implications for density estimation over the class of distributions on the cube [0, 1]d with uniformly bounded derivatives up to order β β 1 and (β β)-H older continuous βth derivative; call such distributions simply β-H older smooth.5 Writing Bj for the cube with center (j 1 2)ϵ1/β and sidelength ϵ1/β where j {1, . . . , ϵ 1/β}d, it is known that Bj (f(x) g(x))dx [0,1]d |f(x) g(x)|dx + O(ϵ) for any β-H older smooth densities f, g and an ϵ-independent constant c > 0, see for example (Arias Castro et al., 2018, Lemma 7.2) or (Ingster and Suslina, 2003, Proposition 2.16). In other words, discretizing such distributions using a regular grid with Ω(ϵ d/β) cells maintains total-variation distances up to an additive O(ϵ) error. Now, consider a multilayer perceptron , that is, a fully connected multilayer neural network with activations given by x 7 1{x 0}. Such a multilayer network with large enough hidden layers can in principle implement the discretization described above, and embed the ϵ d/β cells as an error correcting code. Thus, due to Theorem 21, the ERM density estimator (2) would achieve the minimax optimal density estimation rate n β/(2β+d) over β-H older smooth densities up to polylog factors provided the discriminator class D includes the aforementioned multilayer perceptron and has VC-dimension at most polylog in 1/ϵ. This observation essentially generalizes Theorem 1, which shows that if the discriminator class includes only the single layer (with one neuron) perceptron then the best possible rate is n β/(2β+d+1). 4.4 A Stopping Criterion for Smooth Density Estimation As a corollary to our results, we propose a stopping criterion for training density estimators. Before doing so, let us record a result about the concentration properties of the empirical energy distance about its expectation. Lemma 22. Let ν be supported on a compact subset Ω Rd, and let νn be its empirical measure based on n i.i.d. observations. For every γ (0, 2) there exists a constant C1 = C1(Ω, γ) > 0 such that P Eγ(ν, νn) C1 n + t 2 exp nt2 In other words, Eγ(ν, νn) is O(1/n)-sub Gaussian. Proof Recall the MMD formulation of the generalized energy distance from Section 2.3. The corresponding kernel is given by kγ(x, y) = x γ + y γ x y γ. Clearly sup x,x ,y,y supp(ν) (kγ(x, y) kγ(x , y )) diam(Ω)γ. 5. Note that this class is not the same as PS, although related. GERBER, JIANG, POLYANSKIY, AND SUN Therefore, by Mc Diarmid s inequality we know that Eγ(ν, νn) is O(1/n)-sub Gaussian (note we don t track constants depending on Ωhere). From Lemma 7 we know that EEγ(ν, νn) 1/ n, and the conclusion follows. Consider the following scenario: we have i.i.d. training data X1, . . . , Xn from some distribution ν and we are training an arbitrary generative model to estimate ν. Suppose that this training process gives us a sequence of density estimators {µk}k 1, which could be the result of, say, subsequent gradient descent steps on our parametric class of generators. Is there any way to figure out after how many steps K we may stop the training process? In other words, can we identify a value of K such that TV(ν, µK) is guaranteed to be less than some threshold with probability 1 δ? Note that an additional difficulty here is that our generative model for µk is able to generate the samples from µk but otherwise gives us no other access to µk. The fast (dimension-free) concentration properties of Eγ and the minimax optimality guarantees of its minimizer (whenever ν is smooth) make it an excellent choice for such a stopping criteria. Let ν PS(β, d, C) and let νn be its empirical version based on the n i.i.d. observations. Assume further that {µk}k 1 PS(β, d, C) is a sequence of density estimators based on the sample X1, . . . , Xn. Finally, given the training sample (X1, . . . , Xn), for each k let µk,mk = 1 mk Pmk i=1 δX(k) i be the empirical distribution of the sample (X(k) 1 , . . . , X(k) mk) µ mk k . Proposition 23. For any β > 0, d 1 and γ (0, 2) there exists a constant c = c(β, d, γ) such that TV(µk, ν) c n + Eγ(µk,mk, νn) ! 2β 2β+d+γ , k 1 provided we take mk = cn log(k2/δ)/ log(1/δ). Proof Let c = C1 where C1 is as in Lemma 22 and fix δ (0, 1). Define the event A = Eγ(ν, νn) c n + q and similarly Ak = Eγ(µk,mk, µk) c mk + r ctk for some sequence t1, t2, . . . , and each k 1. By Lemma 22, P(Ak) = EP(Ak|X1, . . . , Xn) 2 exp( tk). Taking tk = log(k2π2/(3δ)), the union bound gives DENSITY ESTIMATION USING THE PERCEPTRON By the inequality Eγ(µk, ν) Eγ(µk, µk,mk) + Eγ(µk,mk, νn) + Eγ(νn, ν) it follows that k : Eγ(ν, µk) > Eγ(µk,mk, νn) + c n + c mk + c log(k2π2/(3δ)) Thus, by choosing mk n log(k2/δ)/ log(1/δ) we can conclude that there exists a constant c depending only on β, d, γ such that Eγ(ν, µk) c r n + Eγ(µk,mk, νn), k 1 The final conclusion follows from Theorem 11. Note that our bound on the probability holds for all k simultaneously, which is made possible by the fact that mk grows as k . The empirical relevance of such a result is immediate: suppose we have proposed candidate generative models µ1, µ2, . . . (e.g. one after each period of training epochs, or from different training models) that is trained on an i.i.d. dataset X1, . . . , Xn of size n from ν PS(β, d, C). A verifier only needs to request for mk independent draws from the k th candidate, and if we ever achieve E(log n) 1 (µk,mk, νn) p log(1/δ)/n we can stop training and claim by Theorem 18 that we are a constant factor away from (near-)minimax optimality with probability 1 δ. 5. Suboptimality for Two-Sample Testing So far in this paper we have shown how the empirical energy distance minimizer, while being mismatched with the target total variation loss, nevertheless achieves nearly minimax optimal performance for density estimation tasks. Unfortunately, this surprising effect does not carry over to other statistical tasks, such as two-sample testing, which we describe in this section. The task of two-sample testing over a family of distributions P is the following. Given two samples (X, Y ) p n q m with unknown distribution, we need to distinguish between the hypotheses H0 : p = q and p P, versus H1 : TV(p, q) > ε, and p, q P with vanishing type-I and type-II error. The special case of m = is known as goodness-of-fit testing, and for the class of smooth distributions it was famously solved by Ingster (1987), who showed that in dimension d = 1 the problem is solvable with probability 1 o(1) if and only if n = ω(ϵ 2β+d/2 in which case a variant of the χ2-test works. The case of general m, n and d 1 was resolved in Arias-Castro et al. (2018) who showed that the problem is solvable if and only if (33) holds with n replaced by min{n, m}, using the very same χ2-test; see also Li and Yuan (2019). In the remainder of the section we focus on the m = n case for simplicity. GERBER, JIANG, POLYANSKIY, AND SUN In a recent paper (Paik et al., 2023), the following test statistic for two-sample testing was proposed: Td,k(p, q) = max (w,b) Sd 1 [0, ) EX p w X b k + EY q w Y b k where the arguments X, Y can be either discrete (e.g. via observed samples) or continuous densities. Note that here we take (a)0 + = 1{a 0} by convention. Specifically, the test proposed is to reject the null hypothesis when Td,k(pn, qn) tn, (34) where pn = 1 n Pn i=1 δXi, qn = 1 n Pn i=1 δYi are empirical measures and the threshold that satisfies both tn = o(1) and tn = ω(1/ n). One of their main technical results (Paik et al., 2023, Theorem 6) asserts that the test (34) returns the correct hypothesis with probability 1 o(1) asymptotically as n for any qualifying sequence {tn}n 1 and fixed p, q. However, this result leaves open questions about the sample complexity of their test, and in particular, whether it is able to achieve known minimax rates. It turns out that our results imply that their test, at least in the k = 0 case, cannot attain the optimal two-sample testing sample complexity (33) over the smooth class PS(β, d, C). To connect to our results, notice that Td,0(p, q) = d H(p, q) . Proposition 24. For all d, β > 0, there exists constants c = c(d, β), c = c (d, β) such that for all ε > 0, there exists probability density functions p, q supported on the d-dim unit ball such that 1. p β,2, q β,2 < c, 2. p q 1 p q 2 ε, and 3. the expected test statistic satisfies E[Td,0(pn, qn)] c for any n (log 1 ε) dε 2β+d+1 In other words, consistent testing using the statistic Td,0 is impossible with n = o(ε 2β+d+1 β ) samples, which is a far cry from the optimal sample complexity (33) attainable by the χ2 test. The proof of Proposition 24 is given at Section F. 6. Conclusion We analyzed the simple discriminating class of affine classifiers and proved its effectiveness in the ERM-GAN setting (2) within the Sobolev class PS(β, d) and Gaussian mixtures PG(d) with respect to the L2 norm (see Theorem 18 and corollary 19) and the total variation distance (see Theorem 1). Our findings affirm the rate s near-optimality for the considered classes of PS and PG. Moreover, we present inequalities that interlink the Eγ, TV, and L2 distances, and demonstrate (in some cases) the tightness of these relationships via corresponding lower bound constructions (Section E). We DENSITY ESTIMATION USING THE PERCEPTRON also interpret the generalized energy distance in several ways that help advocate for its use in real applications. This work connects to a broader literature on the theoretical analysis of GAN-style models. An interesting question emerges about the interaction between the expressiveness and concentration of the discriminator class. We found that the class of affine classifiers D1 is guaranteed to maintain some (potentially small) proportion of the total variation distance, and that it decays at the parametric rate between population and empirical distributions. Thus, we have traded off expressiveness for better concentration of the resulting IPM. As discussed in Section 1.2, Yatracos estimator lies at the other end of this discriminator expressiveness-concentration trade-off: the distance d Y is as expressive as total variation when restricted to the generator class G, but supν G Ed Y (ν, νn) decays strictly slower than 1/ n for nonparametric classes G. A downside compared to d H is that (i) the Yatracos class Y requires knowledge of G while our D1 is oblivious to G and (ii) the distance d Y is impractical to compute as it requires a covering of G. Our question is: is it possible to find a class of sets S 2B(0,1) that lies at an intermediate point on this trade-off? In other words, does S exist such that the ERM ν defined in (2) using the discriminator class D = S is optimal over, say, G = PS and the induced distance converges slower than 1/ n but faster than n β/(2β+d) between empirical and population measures? Would there be desiderata for a sample-efficient discriminator that has neither full expressiveness against total variation and does not concentrate at a parametric rate? Acknowledgments Supported in part by the NSF grant CCF-2131115 and the MIT-IBM Watson AI Lab. GERBER, JIANG, POLYANSKIY, AND SUN Appendix A. Related works showing neural discriminator implies distance is small In Section 1.2 we already reviewed at a high level existing results on GANs and density estimation. Here we provide full details and emphasize differences with our work. A.1 Results on smooth pushforward densities A range of works has focused on showing results for not smooth densities but smooth pushforwards g#U (for example U Unif([0, 1]d) and g is β-H older smooth). While the exact setting differs, it can be shown that at least for TV estimation on the open unit cube (or unit ball), smooth pushforward estimations imply smooth density estimation. Indeed, one can first assume without loss of generality that the target density is bounded below via a linear transform f 1 2(unif +f) (see e.g. Lemma 32.18 in Polyanskiy and Wu (2023+)) and bounded above from smoothness. Furthermore, for any two (H older) β-smooth densities bounded above and below on (open) uniformly convex domains, the unique optimal transport map with respect to the quadratic cost is (β + 1)-smooth (see Lemma 10 in Chae et al. (2023) as well as Theorem 4.14 in Villani (2003)). Therefore, these rates could be compared to our results. 1. (Concurrent with ours) St ephanovitch et al. (2023), Model 1 and 3. This model assumes that the true density is g#U for U uniform on d-dim torus and g is (β + 1)-H older smooth from the torus to Rp. The authors obtain optimal rates with respect to discriminator class consisting of γ-smooth functions for γ 1 (the case of γ = 1 matches with Wasserstein W1). Under the assumption that the target density is β-smooth, results for comparing W1 and TV (Chae (2024) and Corollary 4.4 in St ephanovitch et al. (2023)) apply. By substituting in the rate of W1 distance (γ = 1) in Theorem 5.8 (Model 3 therein) e O n β+1 2β+d with Chae and Walker (2020), one gets the optimal rate e O n β 2β+d . However, their discriminator crucially rely on knowledge of parameters n, β (see (4.5) therein) 2. Chae (2022) The authors obtained results for W1 distance of smooth pushforwards where the observed samples are Gaussian-noised with variance σ. In the zero-noise case, their convergence results in W1 on β-H older smooth pushforward functions give slightly suboptimal rates of O n β 2β+d (Theorem 3.2 therein) compared to the (conjectured) optimal rate O n β 2β+d 2 (Theorem 4.1 therein). As above, their results on W1 could be applied to total variation in this setting by known distance comparison inequalities. Furthermore, their discriminator is a large non-parametric class (not a neural network). 3. Belomestny et al. (2021) The authors studied the GAN estimator on U Unif([0, 1]d) and the pushforward map g is Λ-regular and (β+1)-H older smooth (despite having the same minimax rate, this class do not cover all bounded β-smooth densities, see discussion after Theorem 3 therein). Their results (Theorem 2) obtained (log)optimal JS divergence rates, equivalent to TV since TV2 JS. In contrast to our work, their discriminator class is essentially a non-parametric class (of p(x) q(x)+p(x), where p, q range over pairs of smooth densities) that is approximated by giant DENSITY ESTIMATION USING THE PERCEPTRON neural networks, for which they assumed existence of an oracle approximating smooth discriminator functions over those networks (Belomestny et al. (2023)). Furthermore, this discriminator network needs to be larger than some function of β and n whereas ours is fixed. A.2 Results on smooth densities There is also a large literature on GAN estimation of smooth densities (with slightly varying definitions) that are useful to review for contrasting with our own work. 1. Singh et al. (2018) and Uppal et al. (2019) While these works focus on projection density estimators, Theorem 7 in Singh et al. (2018) and Theorem 9 in Uppal et al. (2019) have shown smooth density estimation rates on GANs. The authors have shown that if the discriminator class (functions of Sobolev smoothness and Besov smoothness s, resp.) is replaced with a neural network then the GAN estimator has optimal rate if the neural networks well approximate the smoothness class, for which known results such as Yarotsky (2017) applies. However, bounds on TV or W1 are not directly covered in their theorems. 2. Chen et al. (2020) The authors considered H older-smooth densities on convex support that are bounded below. They obtained sub-optimal rates O n γ 2γ+d n β+1 2(β+1)+d on the IPM for H older smooth functions Hγ, γ 1 (where γ = 1 transfers to Lipschitz functions with W1 being the re- spective IPM) versus the minimax rate O n β+γ 2 . They also had to rely on (oracle) neural network universal approximating discriminators that depends on β. To conclude this section, we mention that there are also several recent works on non-GAN estimators that we omitted, such as (St ephanovitch et al., 2023, Model 2), (Liang, 2021, Theorem 4), Divol (2022) and Tang and Yang (2023). We recommend survey in Section 2.3 and Table 1 of St ephanovitch et al. (2023). Appendix B. Auxiliary Technical Results In this section we list some technical lemmas that are used in our later proofs. Theorem 25 (Plancherel theorem). Let f, g L2(Rd). Then Z Rd f(x)g(x)dx = 1 (2π)d Rd bf(ω)bg(ω)dω. Lemma 26. Suppose t, x, y > 0. Then there exist finite t-dependent constants C1, C2 such that x y log(3 + 1/y)t = x log(3 + 1/x)t C1 y = x C2 y log(3 + 1/y)t. Proof Let us focus on the first implication. If x y, then it clearly holds. If y x y log(3 + 1/y)t then it suffices to show x log(3 + 1/x)t y log(3 + 1/y)t log(3 + 1/(y log(3 + 1/y)t))t GERBER, JIANG, POLYANSKIY, AND SUN The second inequality is equivalent to 3 + 1/y (3 + 1/(y log(3 + 1/y)t)) t C1. Now, if y 1/2 then clearly taking C1 = log3(5)t works. Suppose that instead y (0, 1/2). Then, since log grows slower than any polynomial, there exists a t-dependent constant ct < such that log(3 + 1/y) cty 1/(2t) for all y (0, 1/2). Therefore, we have 3 + 1 y log(3 + 1/y)t 3 + 1 ct ty1/2 . It is then clear that y 3 + 1 ct ty1/2 holds for all y (0, 1/2) if we take C1 large enough in terms of t. The second implication follows analogously and we omit its proof. Lemma 27. Let µ be a probability distribution on Rd and γ (0, 2). Then Rd (cos ω, X 1)2 + sin2 ω, X ω d+γ dω 16πd/2Mγ(µ) Γ(d/2)γ(2 γ). Proof We use the inequalities (cos t 1)2 + sin2(t) 4(t2 1) valid for all t R. Plugging in and using the Cauchy-Schwarz inequality, the quantity on the left hand side can be bounded as Rd 1 ( ω 2 X 2) ω d+γ dω 4 vold 1(Sd 1)E Z 1 rγ 1 dr + Z X 1 1 r1+γ dr = 16πd/2Mγ(ν) where vold 1(Sd 1) = 2πd/2 Γ d 2 is the surface area of the unit (d 1)-sphere. Lemma 28. For γ (0, 2) define sup0 1 case, one immediately has Bγ R 0 min{1,ω2/2} ω(1+γ)/2 dω < . Now let us consider γ 1. One has that Bγ = sup00 2|Iγ(c)| where (R a 1 sin(ω) ω dω if γ = 1, R a 1 cos(ω) ω(1+γ)/2 dω if γ (0, 1), Clearly a 7 Iγ(a) is continuous on (0, ) for each γ (0, 2). Moreover, |Iγ(a)| |a 1| max{a (γ+1)/2, 1}. Therefore, we only need to show that lima |Iγ(a)| and lima 0 |Iγ(a)| are both finite. Let gγ(x) = x (γ+1)/2 and fγ(x) = sin(x) if γ = 1 and cos(x) otherwise, so that we may write Iγ(a) = R 1 a fγ(ω)gγ(ω)dω. 1. For a , since gγ( ) = 0 and | R a 1 fγ(x)dt| 2 is uniformly bounded, R 1 f(ω)g(ω)dω converges to a finite value by Dirichlet s test for improper integrals (Malik and Arora, 1992, page 391).6 2. For a 0, the conclusion follows by the inequality | sin(ω)| min{|ω|, 1} in the case γ = 1, and by |Iγ(a)| R 1 a ω (1+γ)/2dω R 1 0 ω (1+γ)/2dω = 2 1 γ for γ < 1. Therefore, supa>0 |Iγ(a)| < which concludes the proof. Lemma 29. Let R 0 dω limϵ 0 R 1/ϵ ω ϵ dω and recall the definition of ψγ from (17). Then, for x = 0 the following hold: ψγ(x) = Cψγ R 0 sin(ωx) 2 for γ = 1, R 0 cos(ωx) ω(1+γ)/2 dω for γ (0, 1), R 0 cos(ωx) 1 ω(1+γ)/2 dω for γ (1, 2), cos( π(γ 1) 2 ) 1 if γ = 1, 1 π if γ = 1. Proof For x = 0 clearly w dw = sign(x) Z w dw = sign(x)π 6. Reference pointed out by user Siminore on math.stackexchange.com. GERBER, JIANG, POLYANSKIY, AND SUN which shows the first claim. Assume from here on without loss of generality that x > 0. For γ (0, 1), by the residue theorem, Z cos(ωx) ω(1+γ)/2 dω = x(γ 1)/2 Z = x(γ 1)/2 ℜ ie i π z(1+γ)/2 dz = x(γ 1)/2 cos π(γ 1) Γ((1 γ)/2). Similarly, for γ (1, 2), integration by parts and the residue theorem gives Z ω(1+γ)/2 dω = x(γ 1)/2 Z 0 (cos(ω) 1)d 1 ((γ 1)/2) ω(γ 1)/2 = x(γ 1)/2 Z sin(w) (γ 1)/2 ω(γ 1)/2 dω = x(γ 1)/2 2 γ 1 = x(γ 1)/2 2 γ 1I ie π 2 (γ 1)/2 Z z(γ 1)/2 dz = x(γ 1)/2 2 γ 1 cos π(γ 1) Γ(1 (γ 1)/2) = x(γ 1)/2 cos π(γ 1) Γ((1 γ)/2). Lemma 30. Let φ be the probability density function of N(0, σId) and write bφ for its Fourier transform. Then, for any β 1, bφ(ω) ω β 2 2 = πd/2 Γ(d/2)σ2β+d Γ 2β + d Γ(d/2)σ2β+d Proof It is well known that bφ(ω) = e σ2 2 ω 2. The claimed equality then follows from the formula for the 2β th moment of the Gaussian distribution with mean 0 and variance 1/(2σ2). The inequality follows by Lemma 31. Lemma 31 (Properties of the gamma function). For all x > 1 the inequality Γ(x) 5(x/e)x 1/2 Proof In Minc and Sathre (1964) authors showed that log Γ(x) (x 1 2) log(x) x+ 1 2 log(2π)+1 for all x 1, from which the second claim follows as exp(1 2 log(2π) + 1/2) < 5. DENSITY ESTIMATION USING THE PERCEPTRON Lemma 32. Let b 1 and |a| < b. Then Z r 0 xa| sin(x)|bdx = O(ra+b) as r , where we hide constants depending on a, b. Proof Since we are only interested in the asymptotic behaviour as r , assume without loss of generality that r 1. Then, we have Z r 0 xa| sin(x)|bdx = Z 1 0 xa| sin(x)|bdx | {z } I 1 xa| sin(x)|bdx | {z } II Using the inequality | sin(x)| x, we can bound the first term as 0 xa+bdx = 1 a + b + 1 1 = O(ra+b), since a + b > 0. For the second term, we obtain a+1 if a = 1 log(r) if a = 1 where the last step uses a + b > 0 and b 1. Lemma 33. Let a, b, c R with b > 0 be constants. For all large enough r one has Z r xa exp bx log2(x + 2) Proof Assume, without loss of generality, that c 0. For all large enough x one has exp bx log2(x + 2) Therefore, for large enough r, Z r xa exp bx log2(x + 2) r x c 2dx r c 1 < r c. Lemma 34. Let Jν be the Bessel function of the first kind of order ν. 1. For all x Rd, Z Rd ei x,v dσ(v) = (2π)d/2 x 1 d/2Jd/2 1( x ). GERBER, JIANG, POLYANSKIY, AND SUN 2. For any ν R, as x 2 πx cos x νπ + O(x 3/2). (35) 3. For all x Rd, Z Bd(0,1) ei x,w dw = 2π d/2 Jd/2( x ). Proof For Item 1 set r = x and s = (2π) 1 in the calculation on page 154 of Stein and Weiss (1971). For Item 2 see (Watson, 1980, Eq. (1) in Section 7.21). For Item 3, we can compute Z w 1 ei x,w dw = Z 1 1 ei x w1 Z w2 2+ +w2 d 1 w2 1 dw2 . . . dwddw1 1 ei x w1(1 w2 1)(d 1)/2dw1. (36) Recall from (Watson, 1980, Section 3.1) the definition of the Bessel function of the first kind as Jν(x) = (x/2)ν 0 cos(x cos(θ)) sin2ν(θ)dθ valid for ν > 1/2; the above is also known as the Poisson representation. Changing variables to u = cos(θ), we see that it is equal to Jν(x) = (x/2)ν 1 eixu(1 u2)ν 1/2du. (37) Comparing (37) with (36) concludes the proof. Lemma 35. There exists a radial function h0 L2(Rd) such that supp(h0) B(0, 1), |bh0(w)| C exp c w log( w + 2)2 for all w Rd, 2 for all w rmin, where C, c, rmin > 0. Proof Apply Theorem 1.4 in Cohen (2023) using the spherically symmetric weight function u : Rd R 0 defined by u(w) = u( w ) = w log( w + 2)2 where (a)+ := max(a, 0) for a R. DENSITY ESTIMATION USING THE PERCEPTRON Appendix C. Proof of Theorem 8 For v Sd 1 and b R let θv(x) = v, x and write ηv θv#(µ ν) for the pushforward of the measure µ ν through the map θv. To start with, we notice that Z h Eψγ( X, v b) Eψγ( Y, v b) i2 dbdσ(v) R ψγ(x b)dηv(x) 2 dbdσ(v), (38) For each v Sd 1, the measure ηv has at most countably many atoms, therefore b 7 ηv({b}) = 0 Leb-almost everywhere. Then, by Tonelli s theorem we can conclude that ηv({b}) = 0 for σ Lebalmost every (v, b), thus going forward we can focus on the case x = b. By Lemma 29, and writing Aϵ = [ϵ, 1/ϵ] for ϵ > 0, we have R ψγ(x b)dηv(x) = Cψγ sin(ω(x b)) cos(ω(x b)) ω(1+γ)/2 if γ (0, 1) cos(ω(x b)) 1 ω(1+γ)/2 if γ (1, 2) Note that in the γ = 1 case we implicitly used that R dηv(x) = 0. To exchange the integral over x and the limit over ε, notice that for any ε > 0 and x = b R, sin(ω(x b)) Bγ if γ = 1, cos(ω(x b)) ω(1+γ)/2 dω Bγ|x b|(γ 1)/2 if γ (0, 1), cos(ω(x b)) 1 ω(1+γ)/2 dω Bγ|x b|(γ 1)/2 if γ (1, 2). where Bγ < depends only on γ and is defined in Lemma 28. We now show that R R |x b|(γ 1)/2d|ηv|(x) < for σ Leb-almost every b, v. To this end, let S = {(b, v) R Sd 1 : R R |x b|(γ 1)/2d|ηv|(x) = } and assume for contradiction (σ Leb)(S) > 0. Then 1([ B,B] Sd 1) S 1S as B , and thus by the monotone convergence theorem there exists a finite B such that Leb(([ B, B] Sd 1) S) > 0. However, by Tonelli s theorem we have Z B R |x b|(γ 1)/2d|ηv|(x) 2 db Z B R |x b|γ 1d|ηv|(x)db 0 bγ 1dbd|ηv|(x) R (B + |x|)γd|ηv|(x) Bγ + EX µ [| v, X |γ] + EY ν [| v, Y |γ] Bγ + Mγ(µ + ν), GERBER, JIANG, POLYANSKIY, AND SUN which, after integration over v Sd 1, leads to a contradiction if Mγ(µ + ν) < . Continuing under the assumption Mγ(µ+ν) < , we can apply the dominated convergence theorem to obtain R ψγ(x b)dηv(x) = Cψγ lim ε 0 sin(ω(x b)) cos(ω(x b)) ω(1+γ)/2 if γ (0, 1) cos(ω(x b)) 1 ω(1+γ)/2 if γ (1, 2) Then by Fubini s theorem, we exchange the order of integration to get R ψγ(x b)dηv(x) = Cψγ lim ε 0 sin(ω(x b)) cos(ω(x b)) ω(1+γ)/2 if γ (0, 1) cos(ω(x b)) 1 ω(1+γ)/2 if γ (1, 2) Notice that R R e iωxdηv(x) = bηv(ω), bηv(ω) = bηv( ω) and bηv(0) = 0, R ψγ(x b)dηv(x) = Cψγ lim ε 0 ( I(e iωbbηv(ω)) if γ = 1 ℜ(e iωbbηv(ω)) if γ = 1 = Cψγ lim ε 0 I bΨγ,v,ε(b) if γ = 1 ℜ bΨγ,v,ε(b) if γ = 1. where we write Ψγ,v,ε(ω) = bηv(ω) ω(1+γ)/2 1{ω Aε}. Notice that Ψγ,v,ε is bounded and compactly supported and thus lies in Lp(R) for any p, and so in particular Ψγ,v,ε L1(R) L2(R), which ensures that bΨγ,v,ε L (R) L2(R). Finally, let us write Ψγ,v(ω) = lim ϵ 0 Ψγ,v,ε(ω) = bηv(ω) ω(1+γ)/2 1{ω > 0} for every ω. We now show that Ψγ,v L2(R) provided Mγ(µ + ν) < , which is assumed throughout. Let (X, Y ) µ ν. We have Z R |Ψγ,v(ω)|2dω = Z (E[cos ω, X cos ω, Y ])2 + (E[sin ω, X sin ω, Y ])2 DENSITY ESTIMATION USING THE PERCEPTRON Using the inequality (a b)2 2(a 1)2 + (b 1)2, a, b R for the cos term, the inequality (a + b) 2a2 + 2b2, a, b R for the sin term, and applying Jensen s inequality to take the expectation outside, we can conclude that Ψγ,v L2(R) by Lemma 27. Thus, by the dominated convergence theorem Ψγ,v,ϵ Ψγ,v 2 0 as ϵ 0. Then, by Parseval s identity bΨγ,v,ϵ bΨγ,v 2 0 (39) as ϵ 0. It is well know that convergence in L2(R) implies that there exists a subsequence {εn} n=1 with εn 0 and bΨγ,v,ϵ bΨγ,v almost everywhere.7 Therefore, by passing to this subsequence, it follows that R ψγ(x b)dηv(x) = Cψγ I bΨγ,v(b) if γ = 1 ℜ bΨγ,v(b) if γ = 1 for σ Leb-almost every (b, v) R Sd 1. Note that since ηv(ω) R, ℜ bΨγ,v(b) = bΨγ,v(b) + bΨγ,v(b) ω(1+γ)/2 eibω + bηv( ω) ω(1+γ)/2 e ibω dω bηv(ω) sign(ω) |ω|(1+γ)/2 eibωdω = F bηv(ω) sign(ω) 2|ω|(1+γ)/2 I bΨγ,v(b) = bΨγ,v(b) bΨγ,v(b) bηv(ω) ω(1+γ)/2 e ibω bηv( ω) ω(1+γ)/2 eibω ! bηv(ω) |ω|(1+γ)/2 sign(ω)eibωdω bηv(ω) sign(ω) 2i|ω|(1+γ)/2 7. We could also conclude this by Carleson s theorem. GERBER, JIANG, POLYANSKIY, AND SUN Plugging (40) and (41) into (38), by Parseval s identity (implicitly using that Ψγ,v L2(R)), we obtain h Eψγ( X, v b) Eψγ( Y, v b) i2 dbdσ(v) = Z R ψγ(x b)dηv(x) 2 dbdσ(v), 4|ω|1+γ dwdσ(v) |ω|1+γ dwdσ(v) Rd |F[µ ν](ω)|2 where the last step uses a polar change of variable. The result follows after comparing with Proposition 5. Appendix D. Proof of Proposition 9 Let S(Rd) be the Schwartz space and S (Rd) be the space of all tempered distributions on Rd. Let τ = µ ν and s = (d + γ)/2. First, note that Z Ks(x)dx (1 + x 2)d < , so by (Landkof, 1972, Theorem 0.10) we have Ks S (Rd). By (Landkof, 1972, Theorem 0.12), since Ks S (Rd) and τ has compact support, c Isf = \ Ks τ = c Ksbτ. By Plancherel s identity, (2π) d 2 Isτ 2 = c Isτ 2 = c Ksbτ 2 = 1 p Fγ(d) Eγ(µ, ν), where the last equality follows from Proposition 5. Appendix E. Proof of Theorem 12 and Proposition 13 In this section we prove both Proposition 12 and Proposition 13. To do so, we give two constructions. The first one, presented in Section E.1, only applies in one dimension and gives optimal results. The second construction is given in Section E.2 applies in all dimensions, but loses a polylogarithmic factor. Notation: Abusing notation, in what follows we write Eγ(f, g) and d H(f, g) even when f and g are not necessarily probability measures or probability densities. We will also write f t,2 = t bf 2 for potentially negative exponents t R. Note that Eγ(f, 0) = p Fγ(d) bf d+γ DENSITY ESTIMATION USING THE PERCEPTRON E.1 The Case d = 1 The Lemma below constructs the difference of two densities that has favorable properties. Lemma 36. Let f(x) = 1{|x| π} sin(rx) with r Z and write fβ = f f for f convolved with itself β 1 times, i.e. f1 = f, f2 = f f and so on. Fix an integer β 1 and let |t| < β. We have fβ t,2 rt, fβ 1 1, and d H(fβ, 0) 1 as r where the constants may depend on β, t. Proof The intuition for the estimates (42) is simple: most of the energy of f (and hence fβ) is at frequencies around |ω| r and thus differentiating t times boosts the L2-energy by rt. A simple computation shows bf(ω) = c( 1)r i r ω2 r2 sin(ωπ). Note that because r Z we have bf bfβ 1. Estimating fβ t,2. By definition we have 0 | bf(ω)|2βω2t dω Z (ω2 r2)2β ω2t sin2β(ωπ)dω. We decompose the integral into three regimes: 1. ω < r/2: here (ω2 r2) r2 and thus 0 (. . . ) r 2β Z r/2 0 ω2t sin2β(ωπ) r2t by Lemma 32. 2. ω > 3r/2: here (ω2 r2) ω2 and thus 3r/2 (. . . ) r2β Z sin2β(ωπ)ω2t ω4β r2βr2t 4β+1 = r1+2t 2β r2t. 3. ω [r/2, 3r/2]: here (ω2 r2) yr, where y = ω r. Note also sin(ωπ) = sin(rπ+yπ) = ( 1)r sin(yπ), and ω r. Thus r/2 (. . . )dω = Z r/2 r/2 (. . . )dy r2β Z r/2 sin2β(yπ)r2t (yr)2β dy r2t Z where the last inequality follows by that the integrand is bounded at 0 and has y 2β y 2 Estimating fβ 1. Follows from fβ 1 fβ 2 1 by the Cauchy-Schwartz inequality and fβ 1 c fβ 1 by the Hausdorff Young inequality. GERBER, JIANG, POLYANSKIY, AND SUN Estimating d H. We get d H(fβ, 0) E1(fβ, 0) fβ 1,2 1 r from the first estimate. For the upper bound, note that \ sign(x) = 2 iω and d H(fβ, 0) = supb 1 2 R fβ(x) sign(x b)dx, so by Plancherel s identity, d H(fβ, 0) sup b Z c fβ(ω)eibω (ω2 r2)β ω 1 sinβ(ωπ). The fact that the above is O(1/r) follows analogously to the proof of our bound on fβ t,2 so we omit it. This concludes our proof. Proof [Proof of Proposition 12 and Proposition 13 for d = 1] We now turn to showing tightness in one dimension, utilizing the density difference constructed in Lemma 36. Given a value of the smoothness β > 0, set β = β + 1 and let fβ be as in Lemma 36 with r = ϵ 1/β for some ϵ (0, 1). Let p0 be a smooth, compactly supported density with infx [ π,π] p0(x) > 0. Define pϵ(x) = p0(x) + ϵfβ(βx)/2 and qϵ(x) = p0(x) ϵfβ(βx)/2. Clearly both pϵ, qϵ are compactly supported probability densities for sufficiently small ϵ, since fβ < and is supported on [ βπ, βπ]. By Lemma 36, for each γ (0, 2) the two densities satisfy pϵ qϵ 1 ϵ, pϵ β,2 qϵ β,2 1, Eγ(pϵ, qϵ) pϵ qϵ (1+γ)/2,2 ϵ 2β , d H(pϵ, qϵ) ϵ This proves both Proposition 12 and Proposition 13 for d = 1. E.2 The Case d > 1 We move on to the case of general dimension. In Section E.2.1 we outline our approach. Then, in Section E.2.2 we give full details of our construction, following the argument outlined in the prior section. E.2.1 OVERVIEW For the discussions below, we will assume that the ambient dimension d 2. Our construction here is less straightforward than for d = 1 in Section E.1 but shares the same basic idea. Recall that the basic premise is that we want to saturate the H older s inequality in eq. (22), which requires the density difference f = µ ν to have Fourier transform be (almost) supported on a sphere. For d = 1 we took f to be a pure sinusoid. However, of course such f is not compactly supported and that is why we multiplied the sinusoid by a rectangle (and then convolved many times to gain smoothness), which served as a mollifier. For d > 1 let us attempt to follow the same strategy and take fr(x) = gr(x)h(x) , where r > 0 is a parameter, h is some compactly supported smooth mollifier and gr(x) is defined implicitly via bgr(ω) = r(1 d)/2δ( ω r), DENSITY ESTIMATION USING THE PERCEPTRON where here and below we denote, a bit informally, by δ( r) a distribution that integrates any smooth compactly supported function φ as follows: Z Rd φ(ω)δ( ω r)dω rd 1 Z Rd φ(rω)dσ(ω) = 2πd/2rd 1 2) E[φ(r X)] , where σ is the unnormalized surface measure of Sd 1 and X is a random vector uniformly distributed on Sd 1. Explicit computation shows gr(x) = F 1[bgr](x) = r (2π)drd/2 Rd ei ω,x δ( ω r)dω = r (2π)d/2 x 1 d/2 Jd/2 1( rx ), where Jν denotes Bessel functions of the first kind of order ν. Notice that g is spherically symmetric and real-valued (some further properties of it are collected below in Lemma 34). Note that |gr(x)| = O(1) as r for any fixed x = 0 (Lemma 34), while at the origin we have |gr(0)| = Ω(r(d 1)/2), which follows from the series expansion of the Bessel function given in for example (Watson, 1980, Section 3.1-3.11). This causes an issue for d > 1, as gr is too large at the origin as r compared to its tails, which makes it difficult to use it as the difference between two probability densities. Hence, we choose our mollifier h to be supported on an annulus instead of on a ball. In addition, it will also be convenient for it to have a super-polynomially decaying Fourier transform, i.e. |bh(w)| H( w ) C exp c w log( w + 2)2 The existence of the desired function h is proven Lemma 37. Note that all of the Fourier energy of gr lies at frequencies ω = r by construction. However, after multiplying by h the energy spills over to adjacent frequencies as well and we need to estimate the amount of the spill. Due to the fast decay of bh we will show, roughly, the following estimates on the behavior of bfr: | bfr(ω)| O(r(1 d)/2)1{ ω r log2(r)} + r(d 1)/2H(max( ω r, log2 r)), and | bfr(ω)| r(d 1)/2 ω as r . Note that the first bound above is super-polynomially decaying in both ω and r, which allows us to show that fr t,2 O(rt) for t > d+2 2 , recalling the notation f t,2 = t bf 2. A direct calculation will also show fr 1 fr fr 2 1 . For a desired total-variation separation ϵ, we will set µ ν = ϵfr and choose r = ϵ 1/β to ensure that ϵ fr β,2 = O(1). For the energy distance between µ and ν these choices yield Eγ(µ, ν) ϵfϵ 1/β d+γ 2 ,2 = O(ϵ1+ d+γ 2β ) = O(TV as required. We now proceed to rigorous details. GERBER, JIANG, POLYANSKIY, AND SUN E.2.2 THE CONSTRUCTION First, we must construct the mollifier h with the properties outlined in Section E.2.1. Recall that a function f is radial (also known as spherically symmetric) if its value at x Rd depends only on x . In other words, f(x) = f(y) holds for all x, y Rd with x = y . Lemma 37. There exists a compactly supported radial Schwartz function h, and a positive sequence {rn} n=1 satisfying rn = Θ(n), such that supp(h) B(0, 1), (43) supp(h) Rd \ B(0, r0), (44) |bh(w)| C exp c w log( w + 2)2 for all w Rd, and (45) bh(rnu) = 0 for all u Sd 1, (46) for universal constants C, c, r0 > 0. Proof First, let h0 be as constructed in Lemma 35, which already satisfies eq. (43) and Equation (45). To address the other two requirements, we modify h0 by convolving it with two additional terms: h(x) := (A0( ) h0(8 ) ρ0( ))(x), where A0 and ρ0 aim to address Equation (44) and Equation (46), respectively, and are defined as A0(x) = exp 1 1/64 ( x 1/2)2 1 x (3/8, 5/8) , ρ0(x) = 1{ x < 1/8}. Before proceeding, note that clearly h is a radial Schwartz function. Let us now verify that h indeed satisfies the four requirements. Note that A0 is an annulus supported on B(0, 5/8)\B(0, 3/8), and both h0(8 ) and ρ0 are supported on B(0, 1/8). Therefore, supp(h) B(0, 7/8)\B(0, 1/8), which implies Equations (43) and (44). We now turn to the other two conditions in Fourier space. Note that bh(w) = (1/8)d b A0(w) bh0(w/8) bρ0(w). From Item 3 of Lemma 34 we know that F[1{ < 1}](w) = 2π Hence, by Item 2 of Lemma 34, the function bρ0(w) = (1/8)d F[1{ < 1}](w/8) has infinitely many zeros near the values of w = 8(2nπ + (d+1)π 4 ) for sufficiently large n Z+, which implies Equation (46). Finally, for Equation (45), note that since both A0 and ρ0 are Schwartz functions, so are their Fourier transforms b A0 and bρ0 so that |bh(w)| (1/8)d b A0 bρ0 |bh0(w/8)| |bh0(w/8)|, concluding the proof. DENSITY ESTIMATION USING THE PERCEPTRON Let h be as constructed in Lemma 37, and define fr = grh (47) for r > 0 and bg(ω) r(1 d)/2δ( ω r). Recall from the overview of our construction that we gave in Section E.2.1 that fr is our proposed density difference which we claim (approximately) saturates H older s inequality in (22). The next Lemma records the properties of fr which will enable us to complete our proof. Lemma 38. Let fr be as in (47) and let {rn} n=1 be the sequence constructed in Lemma 37. The following hold. (i) For all n N we have Z Rd frn(x)dx = 0 and supp(frn) B(0, 1). (ii) We have frn frn 2 frn 1 1, hiding constants independent of n. (iii) For any t > d+2 2 we have frn t,2 = O(rt n logd(rn)) as n , hiding constants independent of n. (iv) Recall the definition of ψγ from (17). For any γ (0, 2) we have sup v Sd 1,b R Rd ψγ( x, v b)frn(x)dx = O(r (d+γ)/2 n log(rn)d) hiding constants independent of n. Proof Let us drop the dependence of rn to simplify notation. Showing (i). Note that R Rd fr(x)dx = bfr(0). Then, bfr(0) = 0 follows from the construction of h and gr. Indeed, bgr is supported on r Sd 1 while bh|r Sd 1 0. The fact that supp(fr) B(0, 1) follows from supp(h) B(0, 1). Showing (ii). Since fr has compact support, we immediately have fr 1 fr 2 fr . As h is continuous and supported on the annulus {x : r0 x 1} by construction, it suffices to bound gr on said annulus. Now, for any x with r0 x 1, we have by Lemma 34 that gr(x) r x 1 d/2 1 p GERBER, JIANG, POLYANSKIY, AND SUN which shows that fr 1. We now turn to lower bounding fr 1. Recall that h is uniformly continuous and nontrivial, hence R |h(u v)|dσ(v) = 0 for some radius u , and thus for all u (u0, u1) (0, 1) for some constants u0, u1. Using that gr is spherically symmetric, we compute Rd |gr(x)||h(x)|dx 0 ud 1gr(u, 0, . . . , 0) Z h(uv)dσ(v)du Jd/2 1(ru) du 1, where the last line follows by (35) once again. Showing (iii). Let 0 < s < r, whose precise value will be set later. For convenience, set Bs = {x Rd : x s} and Bc s = Rd \ Bs. Recall that by definition bfr(ω) = r(1 d)/2 Z Rd bh(ω + x)δ( x r)dx = r(1 d)/2 Z Rd(bh1Bs)(ω + x)δ( x r)dx | {z } I + r(1 d)/2 Z Rd(bh1Bcs)(ω + x)δ( x r)dx | {z } II Let C, c be as in Lemma 37, and H(x) = C exp( c x / log2( x + 2)). Note that bh C. Therefore, the first term in the decomposition (48) can be bounded by |I| Cr(1 d)/2 Z Rd 1{ ω + x s}δ( x r)dx = Cr(1 d)/21{ ω [r s, r + s]} Z Rd 1{ ω + x s}δ( x r)dx r(1 d)/2sd 11{ ω [r s, r + s]}. The second line uses that if ω [r s, r + s] then the integral becomes zero. The third line uses the fact that the surface area of the intersection of Bs with any sphere of any radius (and the one centered at ω with radius r in particular) is at most O(sd 1). Moving on to the second term, we have |II| = r(d 1)/2 Z |(bh1Bcs)(ω + ru)|dσ(u) r(d 1)/2H(max{ ω r, s}) using that H : [0, ) (0, C] is decreasing and that |bh(y)1Bcs(y) H(max{y, s}) for all y Rd. Summarizing, we have the pointwise estimate | bfr(ω)| r(1 d)/2sd 11{ ω [r s, r + s]} + r(d 1)/2H(max{ ω r, s}) (49) for all ω Rd and 0 < s < r. DENSITY ESTIMATION USING THE PERCEPTRON We now show that fr is Lipschitz continuous. Recall from the construction of h (Lemma 37) that h|rn Sd 1 0. Then, we observe that for any ω Rd | bfr(ω)| = r(d 1)/2 Z bh(ω + ru)dσ(u) Z {bh(ω + ru) bh(ru)}dσ(u) = r(d 1)/2 bh Lip 2πd/2 ω r(d 1)/2 ω , (50) where we use that bh is Schwartz by construction, and thus has finite Lipschitz constant bh Lip. With (49) and (50) in hand we can proceed to bounding the norm of fr. Let s = D log(r)2 for a large constant D independent of r, and assume that r is large enough so that s < r/2. Also set θ > 0, whose precise value is specified later. We have fr 2 t,2 = Z Rd ω 2t| bfr(ω)|2dω (50) rd 1 Z ω r θ ω 2t+2dω + Z ω >r θ ω 2t| bfr(ω)|2dω (49) rd 1 θ(2t+d) + r1 d log(r)2(d 1) Z ω >r θ ω 2t1{ ω [r s, r + s]}dω ω >r θ ω 2t H2(max{ ω r, s})dω rd 1 θ(2t+d) + r2t log(r)2d 1 + rd 1 Z r θ u2t+d 1H2(max{u r, s})du. Note that in the derivation above we changed to polar coordinates freely, and that in the second inequality we used the assumption t > d/2 1. Setting θ to any positive value greater than (d 1 2t)/(2t + d) ensures that the first term in the final line is O(r2t). As for the integral term, we can bound it by rd 1H2(s) Z 2r r θ u2t+d 1du + rd 1 Z 2r H2(u/2)du Lemma 33 poly(r) H2(s) + r2t. By taking D large enough (independently of r) in the definition of s = D log2(r) we can make also the first term poly(r) H2(s) less than O(r2t), which concludes the proof of (iii). Showing (iv). The bounds that we develop below are analogous to those given in the proof of (iii). Fix b R and v Sd 1 and define Rd ψγ( v, x b)fr(x)dx. GERBER, JIANG, POLYANSKIY, AND SUN Suppose first that γ = 1. Then, using Lemmas 28 and 29, we know by dominated convergence that ϵ Cψγ cos(t( v, x b)) 1{γ > 1} t(1+γ)/2 fr(x)dtdx = Cψγ lim ϵ 0 Rd eit( v,x b)fr(x) t(1+γ)/2 dx = Cψγ lim ϵ 0 cos(tb) bfr(tv) t(1+γ)/2 dt. Similarly, for γ = 1 we can compute = Cψγ lim ϵ 0 sin( tb) bfr(tv) In either case, we have | | R 0 | bfr(tv)|/t(1+γ)/2dt. Let s = D log2(r) for large D independent of r as in the proof of (iii), and let θ > 0 whose precise value is specified later. Assuming that r is large enough so that s < r/2, for any γ (0, 2) we have | bfr(tv)| t(1+γ)/2 dt + Z r θ | bfr(tv)| t(1+γ)/2 dt (50) r(d 1)/2 Z r θ 0 t(1 γ)/2dt + Z r θ | bfr(tv)| t(1+γ)/2 dt r θ 1 t(1+γ)/2 r(1 d)/2sd 11{t [r s, r + s]} + r(d 1)/2H(max{t r, s}) dt 2 + r (d+γ)/2 logd(r) + H(s)r(d 1)/2 Z 2r r θ dt t(1+γ)/2 + Z H(t/2) t(1+γ)/2 dt Lemma 33 r d 1 2 + r (d+γ)/2 logd(r) + H(s) poly(r) + r 100d, (51) Set θ to any value greater than (2d + γ 1)/(3 γ), which ensures that the first term in (51) is O(r (d+γ)/2). By taking D large enough in the definition of s = D log2(r), we can make H(s) smaller than any polynomial in r, which ensures that the third term in (51) is also O(r (d+γ)/2). We thus obtain the final bound | | r (d+γ)/2 logd(r), concluding our proof. Proof [Proof of Theorem 12 and Proposition 13 for d > 1] Using the functions {frn} n=1 we constructed in Lemma 38, we are ready to prove Propositions 12 and 13 for d > 1. Let p0 be a compactly supported probability density with inf x 1 p0(x) > 0. Fix the smoothness β > 0. Given any desired total variation separation ϵ (0, 1), we can find n0 N such that ϵ 1/β rn0, where we hide an ϵ-independent multiplicative constant. Define pϵ = p0 + ϵfrn0/2 and qϵ = p0 ϵfrn0/2. DENSITY ESTIMATION USING THE PERCEPTRON Clearly pϵ and qϵ are compactly supported probability densities for all small enough ϵ. Moreover, by Lemma 38 they satisfy pϵ qϵ 1 ϵ and pϵ β,2 qϵ β,2 1 and Eγ(pϵ, qϵ) ϵ 2β log(1/ϵ)d and d H(pϵ, qϵ) ϵ 2β log(1/ϵ)d for all fixed γ (0, 2). This concludes our proof. Appendix F. Proof of Proposition 24 Proof Note that d H = Td,0. Let pϵ, qϵ be the compactly supported densities constructed in the proof of Proposition 12 in the general dimensional case. Then by construction ε TV(pϵ, qϵ) pϵ qϵ 2 and pϵ β,2+ qϵ β,2 1 and d H(pϵ, qϵ) ϵ β log(1/ϵ)d. Write pϵ,n and qϵ,n for the empirical measures of pϵ and qϵ respectively, based on n i.i.d. observations each. By the triangle inequality we have Ed H(pϵ,n, qϵ,n) Ed H(pϵ,n, pϵ) + d H(pϵ, qϵ) + Ed H(qϵ, qϵ,n) Lemma 20 1/ n + ϵ 2β log(1/ϵ)d. This completes the proof. Ery Arias-Castro, Bruno Pelletier, and Venkatesh Saligrama. Remember the curse of dimensionality: The case of goodness-of-fit testing in arbitrary dimension. Journal of Nonparametric Statistics, 30(2):448 471, 2018. Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214 223. PMLR, 2017. Yu Bai, Tengyu Ma, and Andrej Risteski. Approximability of discriminators implies diversity in gans. ar Xiv preprint ar Xiv:1806.10586, 2018. Keith Ball. Eigenvalues of euclidean distance matrices. Journal of Approximation Theory, 68(1): 74 82, 1992. Alexander Barg and G David Forney. Random codes: Minimum distances and error exponents. IEEE Transactions on Information Theory, 48(9):2568 2573, 2002. Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In International conference on machine learning, pages 531 540. PMLR, 2018. GERBER, JIANG, POLYANSKIY, AND SUN Denis Belomestny, Eric Moulines, Alexey Naumov, Nikita Puchkin, and Sergey Samsonov. Rates of convergence for density estimation with generative adversarial networks. ar Xiv preprint ar Xiv:2102.00199, 2021. Denis Belomestny, Alexey Naumov, Nikita Puchkin, and Sergey Samsonov. Simultaneous approximation of a smooth function and its derivatives by deep neural networks with piecewisepolynomial activations. Neural Networks, 161:242 253, 2023. David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017. Cl ement L Canonne. A short note on learning discrete distributions. ar Xiv preprint ar Xiv:2002.11457, 2020. Minwoo Chae. Rates of convergence for nonparametric estimation of singular distributions using generative adversarial networks. ar Xiv preprint ar Xiv:2202.02890, 2022. Minwoo Chae. Wasserstein upper bounds of lp-norms for multivariate densities in besov spaces. Statistics & Probability Letters, 210:110131, 2024. Minwoo Chae and Stephen G. Walker. Wasserstein upper bounds of the total variation for smooth densities. Statistics & Probability Letters, 163:108771, 2020. ISSN 0167-7152. doi: https://doi. org/10.1016/j.spl.2020.108771. URL https://www.sciencedirect.com/science/ article/pii/S0167715220300742. Minwoo Chae, Dongha Kim, Yongdai Kim, and Lizhen Lin. A likelihood approach to nonparametric estimation of a singular distribution using deep generative models. Journal of machine learning research, 24(77):1 42, 2023. Minshuo Chen, Wenjing Liao, Hongyuan Zha, and Tuo Zhao. Distribution approximation and statistical estimation guarantees of generative adversarial networks. ar Xiv preprint ar Xiv:2002.03938, 2020. Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. ar Xiv preprint ar Xiv:2209.11215, 2022. Alex Cohen. Fractal uncertainty in higher dimensions. ar Xiv preprint ar Xiv:2305.05022, 2023. Vincent Divol. Measure estimation on manifolds: an optimal transport approach. Probability Theory and Related Fields, 183(1):581 647, 2022. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742 765, 2009. DENSITY ESTIMATION USING THE PERCEPTRON Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. Ildar A Ibragimov and Rafail Z Khasminskii. Estimation of distribution density. Journal of Soviet Mathematics, 21:40 57, 1983. Yu. I. Ingster. Minimax testing of nonparametric hypotheses on a distribution density in the lp metrics. Theory of Probability & Its Applications, 31(2):333 337, 1987. doi: 10.1137/1131042. Yuri Ingster and Irina A Suslina. Nonparametric goodness-of-fit testing under Gaussian models, volume 169. Springer Science & Business Media, 2003. Zeyu Jia, Yury Polyanskiy, and Yihong Wu. Entropic characterization of optimal rates for learning gaussian mixtures. ar Xiv preprint ar Xiv:2306.12308, 2023. Jørn Justesen. Class of constructive asymptotically good algebraic codes. IEEE Transactions on information theory, 18(5):652 656, 1972. Arlene KH Kim and Adityanand Guntuboyina. Minimax bounds for estimating multivariate gaussian location mixtures. Electronic Journal of Statistics, 16(1):1461 1484, 2022. Diederik P Kingma and Max Welling. An introduction to variational autoencoders. Foundations and Trends R in Machine Learning, 12(4):307 392, 2019. Szymon Knop, Jacek Tabor, Przemyslaw Spurek, Igor Podolak, Marcin Mazur, and Stanislaw Jastrzebski. Cramer-wold autoencoder. 2018. doi: 10.48550/ARXIV.1805.09235. Soheil Kolouri, Kimia Nadjahi, Shahin Shahrampour, and Umut Simsekli. Generalized sliced probability metrics. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4513 4517, 2022. doi: 10.1109/ICASSP43922.2022. 9746016. Naum S. Landkof. Foundations of modern potential theory, volume 180. Springer, 1972. Tong Li and Ming Yuan. On the optimality of gaussian kernel based nonparametric tests against smooth alternatives. ar Xiv preprint ar Xiv:1909.03302, 2019. Tengyuan Liang. How well generative adversarial networks learn distributions. The Journal of Machine Learning Research, 22(1):10366 10406, 2021. Subhash Chandra Malik and Savita Arora. Mathematical analysis. New Age International, 1992. Youssef Marzouk, Zhi Ren, Sven Wang, and Jakob Zech. Distribution learning via neural differential equations: a nonparametric statistical perspective. ar Xiv preprint ar Xiv:2309.01043, 2023. Henryk Minc and Leroy Sathre. Some inequalities involving (r!)1/r. Proceedings of the Edinburgh Mathematical Society, 14(1):41 46, 1964. Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Sch olkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends R in Machine Learning, 10(1-2):1 141, 2017. GERBER, JIANG, POLYANSKIY, AND SUN Kimia Nadjahi, Alain Durmus, L ena ıc Chizat, Soheil Kolouri, Shahin Shahrampour, and Umut Simsekli. Statistical and topological properties of sliced probability divergences. Advances in Neural Information Processing Systems, 33:20802 20812, 2020. Francis J Narcowich and Joseph D Ward. Norm estimates for the inverses of a general class of scattered-data radial-function interpolation matrices. Journal of Approximation Theory, 69(1): 84 109, 1992. Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diffusion models are minimax optimal distribution estimators. ar Xiv preprint ar Xiv:2303.01861, 2023. Seunghoon Paik, Michael Celentano, Alden Green, and Ryan J Tibshirani. Maximum mean discrepancy meets neural networks: The radon-kolmogorov-smirnov test. ar Xiv preprint ar Xiv:2309.02422, 2023. Yury Polyanskiy and Yihong Wu. Information Theory: From Coding to Learning. Cambridge University Press, 2023+. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684 10695, 2022. Bernhard Sch olkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press, 06 2001. ISBN 9780262256933. doi: 10.7551/mitpress/4175.001.0001. Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The annals of statistics, pages 2263 2291, 2013. Shashank Singh, Ananya Uppal, Boyue Li, Chun-Liang Li, Manzil Zaheer, and Barnab as P oczos. Nonparametric density estimation under adversarial losses. Advances in Neural Information Processing Systems, 31, 2018. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020. Elias M Stein and Guido Weiss. Introduction to Fourier analysis on Euclidean spaces, volume 1. Princeton university press, 1971. Arthur St ephanovitch, Eddie Aamari, and Cl ement Levrard. Wasserstein gans are minimax optimal distribution estimators. ar Xiv preprint ar Xiv:2311.18613, 2023. G abor J Sz ekely and Maria L Rizzo. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8):1249 1272, 2013. DENSITY ESTIMATION USING THE PERCEPTRON Rong Tang and Yun Yang. Minimax rate of distribution estimation on unknown submanifolds under adversarial losses. The Annals of Statistics, 51(3):1282 1308, 2023. Stefan Tiegel. Hardness of agnostically learning halfspaces from worst-case lattice problems. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 3029 3064. PMLR, 12 15 Jul 2023. URL https://proceedings.mlr.press/v195/tiegel23a.html. Ananya Uppal, Shashank Singh, and Barnab as P oczos. Nonparametric density estimation & convergence rates for gans under besov ipm losses. Advances in neural information processing systems, 32, 2019. Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. C. Villani. Topics in Optimal Transportation. Graduate studies in mathematics. American Mathematical Society, 2003. ISBN 9780821833124. URL https://books.google.com/ books?id=idy FAw AAQBAJ. Sven Wang and Youssef Marzouk. On minimax density estimation via measure transport. ar Xiv preprint ar Xiv:2207.10231, 2022. George N Watson. A Treatise on the Theory of Bessel Functions. Cambridge University Press, 1980. ISBN 052106743X. Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural networks, 94: 103 114, 2017. Yannis G Yatracos. Rates of convergence of minimum distance estimators and Kolmogorov s entropy. The Annals of Statistics, 13(2):768 774, 1985. Cheng Zhang, Judith B utepage, Hedvig Kjellstr om, and Stephan Mandt. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 41(8):2008 2026, 2018.