# nonparametric_density_estimation_under_adversarial_losses__54708888.pdf Nonparametric Density Estimation with Adversarial Losses Shashank Singh1,2, Ananya Uppal3 Boyue Li4 Chun-Liang Li1 Manzil Zaheer1 Barnabás Póczos1 1Machine Learning Department 2Department of Statistics & Data Science 3Department of Mathematical Sciences 4Language Technologies Institute Carnegie Mellon University Corresponding Author: sss1@cs.cmu.edu We study minimax convergence rates of nonparametric density estimation under a large class of loss functions called adversarial losses , which, besides classical Lp losses, includes maximum mean discrepancy (MMD), Wasserstein distance, and total variation distance. These losses are closely related to the losses encoded by discriminator networks in generative adversarial networks (GANs). In a general framework, we study how the choice of loss and the assumed smoothness of the underlying density together determine the minimax rate. We also discuss implications for training GANs based on deep Re LU networks, and more general connections to learning implicit generative models in a minimax statistical sense. 1 Introduction Generative modeling, that is, modeling the distribution from which data are drawn, is a central task in machine learning and statistics. Often, prior information is insufficient to guess the form of the data distribution. In statistics, generative modeling in these settings is usually studied from the perspective of nonparametric density estimation, in which histogram, kernel, orthogonal series, and nearestneighbor methods are popular approaches with well-understood statistical properties [49, 47, 14, 7]. Recently, machine learning has made significant empirical progress in generative modeling, using such tools as generative adversarial networks (GANs) and variational autoencoders (VAEs). Computationally, these methods are quite distinct from classical density estimators; they usually rely on deep neural networks, fit by black-box optimization, rather than a mathematically prescribed smoothing operator, such as convolution with a kernel or projection onto a finite-dimensional subspace. Ignoring the implementation of these models, from the perspective of statistical analysis, these recent methods have at least two main differences from classical density estimators. First, they are implicit, rather than explicit (or prescriptive) generative models [12, 29]; that is, rather than an estimate of the probability of a set or the density at a point, they return novel samples from the data distribution. Second, in many recent models, loss is measured not with Lp distances (as is conventional in nonparametric statistics [49, 47]), but rather with weaker losses, such as d FD(P, Q) = sup f FD E X P[f(X)] E X Q[f(X)] , (1) where FD is a discriminator class of bounded, Borel-measurable functions, and P and Q lie in a generator class FG of Borel probability measures on a sample space X. Specifically, GANs often use losses of this form because (1) can be approximated by a discriminator neural network. This paper attempts to help bridge the gap between traditional nonparametric statistics and these recent advances by studying these two differences from a statistical minimax perspective. Specifically, 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. under traditional statistical smoothness assumptions, we identify (i.e., prove matching upper and lower bounds on) minimax convergence rates for density estimation under several losses of the form (1). We also discuss some consequences this has for particular neural network implementations of GANs based on these losses. Finally, we study connections between minimax rates for explicit and implicit generative modeling, under a plausible notion of risk for implicit generative models. 1.1 Adversarial Losses The quantity (1) has been extensively studied, in the case that FD is a reproducing kernel Hilbert space (RKHS) under the name maximum mean discrepancy (MMD; [18, 46]), and, in a wider context under the name integral probability metric (IPM; [31, 42, 43, 8]). [5] also called (1) the FD-distance, or, when FD is a family of functions that can be implemented by a neural network, the neural network distance. We settled on the name adversarial loss because, without assuming any structure on FD, this matches the intuition of the expression (1), namely that of an adversary selecting the most distinguishing linear projection f FD between the true density P and our estimate b P (e.g., by the discriminator network in a GAN). One can check that d FD : FG FG [0, ] is a pseudometric (i.e., it is non-negative and satisfies the triangle inequality, and d FD(P, Q) > 0 P = Q, although d FD(P, Q) = 0 P = Q unless FD is sufficiently rich). Many popular (pseudo)metrics between probability distributions, including Lp [49, 47], Sobolev [24, 30], maximum mean discrepancy (MMD; [46])/energy [45, 36], total variation [48], (1-)Wasserstein/Kantorovich-Rubinstein [20, 48], Kolmogorov-Smirnov [22, 40], and Dudley [13, 1] metrics can be written in this form, for appropriate choices of FD. The main contribution of this paper is a statistical analysis of the problem of estimating a distribution P from n IID observations using the loss d FD, in a minimax sense over P FG, for fairly general nonparametric smoothness classes FD and FG. General upper and lower bounds are given in terms of decay rates of coefficients of functions in terms of an (arbitrary) orthonormal basis of L2 (including, e.g., Fourier or wavelet bases); note that this does not require FD or FG to have any inner product structure, only that FD L1. We also discuss some consequences for density estimators based on neural networks (such as GANs), and consequences for the closely related problem of implicit generative modeling (i.e., of generating novel samples from a target distribution, rather than estimating the distribution itself), in terms of which GANs and VAEs are usually cast. Paper Organization: Section 2 provides our formal problem statement and required notation. Section 3 discusses related work on nonparametric density estimation, with further discussion of the theory of GANs provided in the Appendix. Sections 4 and 5 contain our main theoretical upper and lower bound results, respectively. Section 6 develops our general results from Sections 4 and 5 into concrete minimax convergence rates for some important special cases. Section 7 uses our theoretical results to upper bound the error of perfectly optimized GANs. Section 8 establishes some theoretical relationships between the convergence of optimal density estimators and optimal implicit generative models. The Appendix provides proofs of our theoretical results, further applications, further discussion of related and future work, and experiments on simulated data that support our theoretical results. 2 Problem Statement and Notation We now provide a formal statement of the problem studied in this paper in a very general setting, and then define notation required for our specific results. Formal Problem Statement: Let P FG be an unknown probability measure on a sample space X, from which we observe n IID samples X1:n = X1, ..., Xn IID P. In this paper, we are interested in using the samples X1:n to estimate the measure P, with error measured using the adversarial loss d FD. Specifically, for various choices of spaces FD and FG, we seek to bound the minimax rate M(FD, FG) := inf b P sup P FG E X1:n h d FD P, b P(X1:n) i of estimating distributions assumed to lie in a class FG, where the infimum is taken over all estimators b P (i.e., all (potentially randomized) functions b P : X n FG). We will discuss both the case when FG is known a priori and the adaptive case when it is not. 2.1 Notation For a non-negative integer n, we use [n] := {1, 2, ..., n} to denote the set of positive integers at most n. For sequences {an}n N and {bn}n N of non-negative reals, an bn and, similarly bn an, indicate the existence of a constant C > 0 such that lim supn an bn C. an bn indicates an bn an. For functions f : Rd R, we write lim z f(z) := sup {zn}n N: zn lim n f(zn), where the supremum is taken over all diverging Rd-valued sequences. Note that, by equivalence of finite-dimensional norms, the exact choice of the norm does not matter here. We will also require summations of the form P z Z f(z) in cases where Z is a (potentially infinite) countable index set and {f(z)}z Z is summable but not necessarily absolutely summable. Therefore, to ensure that the summation is well-defined, the order of summation will need to be specified, depending on the application (as in, e.g., Section 6). Fix the sample space X = [0, 1]d to be the d-dimensional unit cube, over which λ denotes the usual Lebesgue measure. Given a measurable function f : X R, let, for any Borel measure µ on X, p [1, ], and L > 0, f Lp µ := Z X |f|p dµ 1/p and Lp µ(L) := n f : X R f Lp µ < L o (taking the appropriate limit if p = ) denote the Lebesgue norm and ball of radius L, respectively. Fix an orthonormal basis B = {φz}z Z of L2 λ indexed by a countable family Z. To allow probability measures P without densities (i.e., P µ), we assume each basis element φz : X R is a bounded function, so that e Pz := EX P [φz(X)] is well-defined. For constants L > 0 and p 1 and real-valued net {az}z Z, our results pertain to generalized ellipses of the form z Z ap z| efz|p !1/p (where efz := R X fφz dµ is the zth coefficient of f in the basis B). We sometimes omit dependence on L (e.g., Hp,a = Hp,a(L)) when its value does not matter (e.g., when discussing rates of convergence). A particular case of interest is the scale of the Sobolev spaces defined for s, L 0 and p 1 by z Z |z|sp| efz|p !1/p For example, when B is the standard Fourier basis and s is an integer, for a constant factor c depending only on s and the dimension d, Ws,p(c L) := f Lp λ f (s) Lp λ < L corresponds to the natural standard smoothness class of Lp λ functions having sth-order (weak) derivatives f (s) in Lp λ(L) [24]). 3 Related Work Our results apply directly to many of the losses that have been used in GANs, including 1-Wasserstein distance [3, 19], MMD [25], Sobolev distances [30], and the Dudley metric [1]. As discussed in the Appendix, slightly different assumptions are required to obtain results for the Jensen-Shannon divergence (used in the original GAN formulation of [17]) and other f-divergences [33]. Given their generality, our results relate to many prior works on distribution estimation, including classical work in nonparametric statistics and empirical process theory, as well as more recent work studying Wasserstein distances and MMD. Here, we briefly survey known results for these problems. There have also been a few other statistical analyses of the GAN framework; due to space constraints, we discuss these works in the Appendix. L2 λ distances: Classical work on nonparametric statistics has typically focused on the problem of smooth density estimation under L2 λ loss, corresponding the adversarial loss d FD with FD = L2 λ(LD) (the Hölder dual) of L2 [49, 47]. In this case, when FG = Wt,2(LG) is a Sobolev class, then the minimax rate is typically M(FD, FG) n t 2t+d , matching the rates given by our main results. Maximum Mean Discrepancy (MMD): When FD is a reproducing kernel Hilbert space (RKHS), the adversarial loss d FD has been widely studied under the name maximum mean discrepancy (MMD) [18, 46]. When the RKHS kernel is translation-invariant, one can express FD in the form H2,a, where a is determined by the spectrum of the kernel, and so our analysis holds for MMD losses with translation-invariant kernels (see Example 6). To the best of our knowledge, minimax rates for density estimation under MMD loss have not been established in general; our analysis suggests that density estimation under an MMD loss is essentially equivalent to the problem of estimating kernel mean embeddings studied in [46], as both amount to density estimation while ignoring bias, and both typically have a parametric n 1/2 minimax rate. Note that the related problems of estimating MMD itself, and of using it in statistical tests for homogeneity and dependence, have received extensive theoretical treatment [18, 35]. Wasserstein Distances: When FD = W1, (L) is the class of 1-Lipschitz functions, d FD is equivalent to the (order-1) Wasserstein (also called earth-mover s or Kantorovich-Rubinstein) distance. In this case, when FG contains all Borel measurable distributions on X, minimax bounds have been established under very general conditions (essentially, when the sample space X is an arbitrary totally bounded metric space) in terms of covering numbers of X [50, 39, 23]. In the particular case that X is a bounded subset of Rd of full dimension (i.e., having non-empty interior, comparable to the case X = [0, 1]d that we study here), these results imply a minimax rate of M(FD, FG) = n min{ 1 d}, matching our rates. Notably, these upper bounds are derived using the empirical distribution, which cannot benefit from smoothness of the true distribution (see [50]). At the same time, it is obvious to generalize smoothing estimators to sample spaces that are not sufficiently nice subsets of Rd. Sobolev IPMs: The closest work to the present is [26], which we believe was the first work to analyze how convergence rates jointly depend on (Sobolev) smoothness restrictions on both FD and FG. Specifically, for Sobolev spaces FD = Ws,p and FG = Wt,q with p, q 2 (compare our Example 4), they showed 2t+d M(Ws,2, Wt,2) n s+t 2(s+t)+d . (2) Our main results in Sections 4 and 5 improve on this in two main ways. First, our results generalize to and are tight for many spaces besides Sobolev spaces. Examples include when FD is a reproducing kernel Hilbert space (RKHS) with translation-invariant kernel, or when FG is the class of all Borel probability measures. Our bounds also allow other (e.g., wavelet) estimators, whereas the bounds of [26] are for the (uniformly L λ -bounded) Fourier basis. Second, the lower and upper bounds in (2) diverge by a factor polynomial in n. We tighten the upper bound to match the lower bound, identifying, for the first time, minimax rates for many problems of this form (e.g., M(Ws,2, Wt,2) n s+t 2t+d in the Sobolev case above). Our analysis has several interesting implications: 1. When s > d/2, the convergence becomes parametric: M(W s,2, FG) n 1/2, for any class of distributions FG. This highlights that the loss d FD is quite weak for large s, and matches known minimax results for the Wasserstein case s = 1 [10, 39]. 2. Our upper bounds, as in [26], are for smoothing estimators (namely, the orthogonal series estimator 3). In contrast, previous analyses of Wasserstein loss focused on convergence of the (unsmoothed) empirical distribution b PE to the true distribution, which typically occurs at rate of n 1/d +n 1/2, where d is the intrinsic dimension of the support of P [10, 50, 39]. Moreover, if FG includes all Borel probability measures, this rate is minimax optimal [39]. The loose upper bound of [26] left open the questions of whether (when s < d/2) a very small amount (t 0, 2s2 d 2s i ) of smoothness improves the minimax rate and, more importantly, whether smoothed estimators are outperformed by b PE in this regime. Our results imply that, for s < d/2, the minimax rate strictly improves with smoothness t, and that, as long as the support of P has full dimension, the smoothed estimator always converges faster than b PE. An important open problem is to simultaneously leverage when P is smooth and has support of low intrinsic dimension; many data (e.g., images) likely enjoy both these properties. 3. [26] suggested over-smoothing the estimate (the smoothing parameter ζ discussed in Equation (3) below was set to ζ n 1 2(s+t)+d ) compared to the case of L2 λ loss, and hence it was not clear how to design estimators that adapt to unknown smoothness under losses d W s,p. We show that the optimal smoothing (ζ n 1 2t+d ) under d W s,p loss is identical to that under L2 λ loss, and we use this to design an adaptive estimator (see Corollary 5). 4. Our bounds imply improved performance bounds for optimized GANs, discussed in Section 7. 4 Upper Bounds for Orthogonal Series Estimators This section gives upper bounds on the adversarial risk of the following density estimator. For any finite set Z Z, let b PZ be the truncated series estimate z Z b Pzφz, where, for any z Z, b Pz := 1 i=1 φz(Xi). (3) Z is a tuning parameter that typically corresponds to a smoothing parameter; for example, when B is the Fourier basis and Z = {z Zd : z ζ} for some ζ > 0, b PZ is equivalent to a kernel density estimator using a sinc product kernel Kh(x) = Qd j=1 2 h sin(2πx/h) 2πx/h with bandwidth h = 1/ζ [34]. We now present our main upper bound on the minimax rate of density estimation under adversarial losses. The upper bound is given by the orthogonal series estimator given in Equation (3), but we expect kernel and other standard linear density estimators to converge at the same rate. Theorem 1 (Upper Bound). Suppose that µ(X) < and there exist constants LD, LG > 0, real-valued nets {az}z Z, {bz}z Z such that FD = Hp,a(X, LD) and FG = Hq,b(X, LG), where p, q 1. Let p = p p 1 denote the Hölder conjugate of p. Then, for any P FG, h d FD P, b P i LD cp n ( φz L P az 1 1 1/p 1/q The two terms in the bound (4) demonstrate a bias-variance tradeoff, in which the first term (variance) increases with the truncation set Z and is typically independent of the class FG of distributions, while the second term (bias) decreases with Z at a rate depending on the complexity of FG. Corollary 2 (Sufficient Conditions for Parametric Rate). Consider the setting of Theorem 1. If φz 2 L P a2z < and max {az, bz} . whenever z , then, the minimax rate is parametric; specifically, M(FD, FG) LD p In particular, letting cz := supx X |φz(x)| for each z Z, this occurs whenever P z Z c2 z a2z < . In many contexts (e.g., if P λ and λ P), the simpler condition P z Z c2 z a2 z < suffices. The first, and slightly weaker condition in terms of φz 2 L P is useful when we restrict FG; e.g., if B is the wavelet basis (defined in the Appendix) and FG contains only discrete distributions supported on at most k points, then φi,j 2 L P = 0 for all but k values of j [2i], at each resolution i N. The assumption max lim z az, lim z bz = is quite mild; for example, the Riemann-Lebesgue lemma and the assumption that FD is bounded in L λ L1 λ together imply that this condition always holds if B is the Fourier basis. 5 Minimax Lower Bound In this section, we lower bound the minimax risk M(FD, FG) of distribution estimation under d FD loss over FG, for the case when FD = Hp,a and FG := Hq,b are generalized ellipses. As we show in some examples in Section 6, our lower bound rate matches our upper bound rate in Theorem 1 for many spaces FD and FG of interest. Our lower bound also suggests that the assumptions in Corollary 2 are typically necessary to guarantee the parametric convergence rate n 1/2. Theorem 3 (Minimax Lower Bound). Fix X = [0, 1]d, and let p0 denote the uniform density (with respect to Lebesgue measure) on X. Suppose {p0} {φz}z Z is an orthonormal basis in L2 µ, and {az}z Z and {bz}z Z are two real-valued nets. Let LD, LG 0 and p, q 2. For any Z Z, let AZ := |Z|1/2 sup z Z az and BZ := |Z|1/2 sup z Z bz. Then, for FD = Hp,a(LD) and FG := Hq,b(LG), for any Z Z satisfying log 2 and 2LG z Z φz L µ 1, (5) we have M(FD, FG) LGLD|Z| 64AZBZ = LGLD 64 (supz Z az) (supz Z bz). As in most minimax lower bounds, our proof relies on constructing a finite set ΩG of worst-case densities in FG, lower bounding the distance d FD over ΩG, and then letting elements of ΩG shrink towards the uniform distribution p0 at a rate such that the average information (here, Kullback-Leibler) divergence between each p ΩG and p0 does not grow with n. The first condition in (5) ensures that the information divergence between each p ΩG and p0 is sufficiently small, and typically results in tuning of Z identical (in rate) to its optimal tuning in the upper bound (Theorem 1). The second condition in (5) is needed to ensure that the worst-case densities we construct are everywhere non-negative. Hence, this condition is not needed for lower bounds in the Gaussian sequence model, as in Theorem 2.3 of [26]. However, failure of this condition (asymptotically) corresponds to the breakdown point of the asymptotic equivalence between the Gaussian sequence model and the density estimation model in the regime of very low smoothness (e.g., in the Sobolev setting, when t < d/2; see [9]), and so finer analysis is needed to establish lower bounds here. In this section, we apply our bounds from Sections 4 and 5 to compute concrete minimax convergence rates for two examples choices of FD and FG, namely Sobolev spaces and reproducing kernel Hilbert spaces. Due to space constraints, we consider only the Fourier basis here, but, in the Appendix, we also discuss an estimator in the Sobolev case using the Haar wavelet basis. For the purpose of this section, suppose that X = [0, 2π]d, Z = Zd, and, for each z Z, φz is the zth standard Fourier basis element given by φz(x) = ei z,x for all x X. In this case, we will always choose the truncation set Z to be of the form Z := {z Z : z ζ}, for some ζ > 0, so that |Z| ζd. Moreover, for every z Z, φz L µ = 1, and hence CZ 1. Example 4 (Sobolev Spaces). Suppose that, for some s, t 0, az = z s and bz = z t . Then, setting ζ = n 1 2t+d in Theorems 1 and 3 gives that there exist constants C > c > 0 such that 2t+d} M Ws,2, Wt,2 Cn min{ 1 Combining the observation that the s-Hölder space Ws, Ws,2 with the lower bound (over Ws, ) in Theorem 3.1 of [26], we have that (6) also holds when Ws,2 is replaced with Ws,p for any p [2, ] (e.g., in the case of the Wasserstein metric d W1, ). So far, we have assumed the smoothness t of the true distribution P is known, and used that to tune the parameter ζ of the estimator. However, in reality, t is not known. In the next result, we leverage the fact that the rate-optimal choice ζ = n 1 2t+d above does not rely on the loss parameters s, together with Theorem 1 to construct an adaptively minimax estimator, i.e., one that is minimax and fully-data dependent. There is a large literature on adaptive nonparametric density estimation under L2 µ loss; see [14] for accessible high-level discussion and [16] for a technical but comprehensive review. Corollary 5 (Adaptive Upper Bound for Sobolev Spaces). There exists an adaptive choice bζ : X n N of the hyperparameter ζ (independent of s, t), such that, for any s, t 0, there exists a constant C > 0 (independent of n), such that sup P Wt,2 E X1:n IID P h d Ws,2 P, b PZbζ(X1:n) i M Ws,2, Wt,2 (7) Due to space constraints, we present the actual construction of the adaptive bζ in the Appendix, but, in brief, it is a standard construction based on leave-one-out cross-validation under L2 µ loss which is known (e.g., see Sections 7.2.1 and 7.5.2 of [28]) to be adaptively minimax under L2 µ loss. Using the fact that our upper bound Theorem 1 uses a choice of ζ is independent of the loss parameter s, we show that the d Ws, risk of b Pζ can be factored into its L2 µ risk and a component (ζ s) that is independent of t. Since L2 µ risk can be rate-minimized in independently of t, it follows that the d Ws, risk can be rate-minimized independently of t. Adaptive minimaxity then follows from Theorem 3. Example 6 (Reproducing Kernel Hilbert Space/MMD Loss). Suppose Hk is a reproducing kernel Hilbert space (RKHS) with reproducing kernel k : X X R [4, 6]. If k is translation invariant (i.e., there exists κ L2 µ such that, for all x, y X, k(x, y) = κ(x y)), then Bochner s theorem (see, e.g., Theorem 6.6 of [51]) implies that, up to constant factors, Hk(L) := {f Hk : f Hk L} = z Z |eκz|2| efz|2 < L2 ) Thus, in the setting of Theorem 1, we have Hk = H2,a, where az = |eκz| satisfies P z Z a 2 z = κ 2 L2µ < . Corollary 2 then gives M(Hk(LD), FG) LD κ L2µn 1/2 for any class FG. It is well-known known that MMD can always be estimated at the parametric rate n 1/2 [18]; however, to the best of our knowledge, only recently has it been shown that any probability distribution can be estimated at the rate n 1/2 under MMD loss[41], emphasizing the fact that MMD is a very weak metric. This has important implications for applications such as two-sample testing [35]. 7 Consequences for Generative Adversarial Neural Networks (GANs) This section discusses implications of our minimax bounds for GANs. Neural networks in this section are assumed to be fully-connected, with rectified linear unit (Re LU) activations. [26] used their upper bound result (2) to prove a similar theorem, but, since their upper bound was loose, the resulting theorem was also loose. The following results are immediate consequences of our improvement (Theorem 1) over the upper bound (2) of [26], and so we refer to that paper for the proof. Key ingredients are an oracle inequality proven in [26], an upper bound such as Theorem 1, and bounds of [52] on the size of a neural network needed to approximate functions in a Sobolev class. In the following, FD denotes the set of functions that can be encoded by the discriminator network and FG denotes the set of distributions that can be encoded by the generator network. Pn := 1 n Pn i=1 1{Xi} denotes the empirical distribution of the observed data X1:n IID P. Theorem 7 (Improvement of Theorem 3.1 in Liang [26]). Let s, t > 0, and fix a desired approximation accuracy ϵ > 0. Then, there exists a GAN architecture, in which 1. the discriminator FD has at most O(log(1/ϵ)) layers and O(ϵ d/s log(1/ϵ)) parameters, 2. and the generator FG has at most O(log(1/ϵ)) layers and O(ϵ d/t log(1/ϵ)) parameters, such that, if b P (X1:n) := argmin b P FG d FD Pn, b P , is the optimized GAN estimate of P, then sup P Wt,2 E X1:n h d Ws,2 P, b P (X1:n) i C ϵ + n min{ 1 The discriminator and generator in the above theorem can be implemented as described in [52]. The assumption that the GAN is perfectly optimized may be strong; see [32, 27] for discussion of this. Though we do not present this result due to space constraints, we can similarly improve the upper bound of [26] (their Theorem 3.2) for very deep neural networks, further improving on the previous state-of-the-art bounds of [2] (which did not leverage smoothness assumptions on P). 8 Minimax Comparison of Explicit and Implicit Generative Models In this section, we draw formal connections between our work on density estimation (explicit generative modeling) and the problem of implicit generative modeling under an appropriate measure of risk. In the sequel, we fix a class FG of probability measures on a sample space X and a loss function ℓ: FG FG [0, ] measuring the distance of an estimate b P from the true distribution P. ℓneed not be an adversarial loss d FD, but our discussion does apply to all ℓof this form. 8.1 A Minimax Framework for Implicit Generative Models Thus far, we have analyzed the minimax risk of density estimation, namely MD(FG, ℓ, n) = inf b P sup P FG RD(P, b P), where RD(P, b P) =E X1:n IID P h ℓ(P, b P(X1:n)) i (8) denotes the density estimation risk of b P at P and the infimum is taken over all estimators (i.e., (potentially randomized) functions b P : X n FG). Whereas density estimation is a classical statistical problem to which we have already contributed novel results, our motivations for studying this problem arose from a desire to better understand recent work on implicit generative modeling. Implicit generative models, such as GANs [3, 17] and VAEs [21, 37], address the problem of sampling, in which we seek to construct a generator that produces novel samples from the distribution P [29]. In our context, a generator is a function b X : X n Z X that takes in n IID samples X1:n P and a source of randomness (a.k.a., latent variable) Z QZ with known distribution QZ (independent of X1:n) on a space Z, and returns a novel sample b X(X1:n, Z) X. The evaluating the performance of implicit generative models, both in theory and in practice, is difficult, with solutions continuing to be proposed [44], some of which have proven controversial. Some of this controversy stems from the fact that many of the most straightforward evaluation objectives are optimized by a trivial generator that memorizes the training data (e.g., b X(X1:n, Z) = XZ, where Z is uniformly distributed on [n]). One objective that can avoid this problem is as follows. For simplicity, fix the distribution QZ of the latent random variable Z QZ (e.g., QZ = N(0, I)). For a fixed training set X1:n IID P and latent distribution Z QZ, we define the implicit distribution of a generator b X as the conditional distribution P b X(X1:n,Z)|X1:n over X of the random variable b X(X1:n, Z) given the training data. Then, for any P FG, we define the implicit risk of b X at P by RI(P, b X) := E X1:n P h ℓ(P, P b X(X1:n,Z)|X1:n) i . We can then study the minimax risk of sampling, MI(FG, ℓ, n) := inf b X sup P FG RI(P, b X). A few remarks about MI(F, ℓ, n): First, we implicitly assumed ℓ(P, P b X(X1:n,Z)|X1:n) is well-defined, which is not obvious unless P b X(X1:n,Z) FG. We discuss this assumption further below. Second, since the risk RI(P, b X) depends on the unknown true distribution P, we cannot calculate it in practice. Third, for the same reason (because RP (P, b X) depends directly on P rather than particular data X1:n), it detect lack-of-diversity issues such as mode collapse. As we discuss in the Appendix, these latter two points are distinctions from the recent work of [5] on generalization in GANs. 8.2 Comparison of Explicit and Implicit Generative Models Algorithmically, sampling is a very distinct problem from density estimation; for example, many computationally efficient Monte Carlo samplers rely on the fact that a function proportional to the density of interest can be computed much more quickly than the exact (normalized) density function [11]. In this section, we show that, given unlimited computational resources, the problems of density estimation and sampling are equivalent in a minimax statistical sense. Since exactly minimax estimators (argmin b P sup P FG RD(P, b P)) often need not exist, the following weaker notion is useful for stating our results: Definition 8 (Nearly Minimax Sequence). A sequence { b Pk}k N of density estimators (resp., { b Xk}k N of generators) is called nearly minimax over FG if limk sup P FG RP,D( b Pk) = MD(FG, ℓ, n) (resp., limk sup P FG RP,I( b Xk) = MI(FG, ℓ, n)). The following theorem identifies sufficient conditions under which, in the statistical minimax framework described above, density estimation is no harder than sampling. The idea behind the proof is as follows: If we have a good sampler b X (i.e., with RI( b X) small), then we can draw m fake samples from b X. We can use these fake samples to construct a density estimate b P of the implicit distribution of b X such that, under the technical assumptions below, RD( b P) RI( b X) 0 as m . Theorem 9 (Conditions under which Density Estimation is Statistically no harder than Sampling). Let FG be a family of probability distributions on a sample space X. Suppose (A1) ℓ: P P [0, ] is non-negative, and there exists C > 0 such that, for all P1, P2, P3 FG, ℓ(P1, P3) C (ℓ(P1, P2) + ℓ(P2, P3)). (A2) MD(FG, ℓ, m) 0 as m . (A3) For all m N, we can draw m IID samples Z1, ..., Zm IID QZ of the latent variable Z. (A4) there exists a nearly minimax sequence of samplers b Xk : X n Z X such that, for each k N, almost surely over X1:n, P b Xk(X1:n,Z)|X1:n FG. Then, MD(FG, ℓ, n) C MI(FG, ℓ, n). Assumption (A1) is a generalization of the triangle inequality (and reduces to the triangle inequality when C = 1). This weaker assumption applies, for example, when ℓis the Jensen-Shannon divergence (with C = 2) used in the original GAN formulation of [17], even though this does not satisfy the triangle inequality [15]). Assumption (A2) is equivalent to the existence of a uniformly ℓ-risk-consistent estimator over FG, a standard property of most distribution classes FG over which density estimation is studied (e.g., our Theorem 1). Assumption (A3) is a natural design criterion of implicit generative models; usually, QZ is a simple parametric distribution such as a standard normal. Finally, Assumption (A4) is the most mysterious, because, currently, little is known about the minimax theory of samplers when FG is a large space. On one hand, since MI(FG, ℓ, n) is an infimum over b X, Theorem 9 continues to hold if we restrict the class of samplers (e.g., to those satisfying Assumption (A4) or those we can compute). On the other hand, even without restricting b X, this assumption may not be too restrictive, because nearly minimax samplers are necessarily close to P FG. For example, if FG contains only smooth distributions but b X is the trivial empirical sampler described above, then ℓ(P, P b X) should be large and b X is unlikely to be minimax optimal. Finally, in practice, we often do not know estimators that are nearly minimax for finite samples, but may have estimators that are rate-optimal (e.g., as given by Theorem 1), i.e., that satisfy C := lim sup n sup P FG RI(P, b X) MI(FG, ℓ, n) < . Under this weaker assumption, it is straightforward to modify our proof to conclude that lim sup n MD(FG, ℓ, n) MI(FG, ℓ, n) C C. The converse result (MD(FG, ℓ, n) MI(FG, ℓ, n)) is simple to prove in many cases, and is related to the well-studied problem of Monte Carlo sampling [38]; we discuss this briefly in the Appendix. 9 Conclusions Given the recent popularity of implicit generative models in many applications, it is important to theoretically understand why these models appear to outperform classical methods for similar problems. This paper provided new minimax bounds for density estimation under adversarial losses, both with and without adaptivity to smoothness, and gave several applications, including both traditional statistical settings and perfectly optimized GANs. We also gave simple conditions under which minimax bounds for density estimation imply bounds for the problem of implicit generative modeling, suggesting that sampling is typically not statistically easier than density estimation. Thus, for example, the strong curse of dimensionality that is known to afflict to nonparametric density estimation Wasserman [49] should also limit the performance of implicit generative models such as GANs. The Appendix describes several specific avenues for further investigation, including whether the curse of dimensionality can be avoided when data lie on a low-dimensional manifold. zzz Acknowledgments This work was partly supported by NSF grant IIS1563887, the Darpa D3M program, AFRL FA875017-2-0212, and the NSF Graduate Research Fellowship DGE-1252522. [1] Ehsan Abbasnejad, Javen Shi, and Anton van den Hengel. Deep lipschitz networks and dudley GANs, 2018. URL https://openreview.net/forum?id=rkw-jlb0W. [2] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009. [3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214 223, 2017. [4] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337 404, 1950. [5] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In International Conference on Machine Learning, pages 224 232, 2017. [6] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. [7] Gérard Biau and Luc Devroye. Lectures on the nearest neighbor method. Springer, 2015. [8] Leon Bottou, Martin Arjovsky, David Lopez-Paz, and Maxime Oquab. Geometrical insights for implicit generative modeling. ar Xiv preprint ar Xiv:1712.07822, 2017. [9] Lawrence D Brown, Cun-Hui Zhang, et al. Asymptotic nonequivalence of nonparametric experiments when the smoothness index is 1/2. The Annals of Statistics, 26(1):279 287, 1998. [10] Guillermo Canas and Lorenzo Rosasco. Learning probability measures with respect to optimal transport metrics. In Advances in Neural Information Processing Systems, pages 2492 2500, 2012. [11] Siddhartha Chib and Edward Greenberg. Understanding the Metropolis-Hastings algorithm. The american statistician, 49(4):327 335, 1995. [12] Peter J Diggle and Richard J Gratton. Monte Carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society. Series B (Methodological), pages 193 227, 1984. [13] RM Dudley. Speeds of metric probability convergence. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 22(4):323 332, 1972. [14] Sam Efromovich. Orthogonal series density estimation. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4):467 476, 2010. [15] Dominik Maria Endres and Johannes E Schindelin. A new metric for probability distributions. IEEE Transactions on Information theory, 49(7):1858 1860, 2003. [16] Alexander Goldenshluger and Oleg Lepski. On adaptive minimax density estimation on Rd. Probability Theory and Related Fields, 159(3-4):479 543, 2014. [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014. [18] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012. [19] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5769 5779, 2017. [20] Leonid Vasilevich Kantorovich and Gennady S Rubinstein. On a space of completely additive functions. Vestnik Leningrad. Univ, 13(7):52 59, 1958. [21] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014. [22] Andrey Kolmogorov. Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn., 4:83 91, 1933. [23] Jing Lei. Convergence and concentration of empirical measures under wasserstein distance in unbounded functional spaces. ar Xiv preprint ar Xiv:1804.10556, 2018. [24] Giovanni Leoni. A first course in Sobolev spaces. American Mathematical Soc., 2017. [25] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. MMD GAN: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2200 2210, 2017. [26] Tengyuan Liang. How well can generative adversarial networks (GAN) learn densities: A nonparametric view. ar Xiv preprint ar Xiv:1712.08244, 2017. [27] Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. ar Xiv preprint ar Xiv:1802.06132, 2018. [28] Pascal Massart. Concentration inequalities and model selection, volume 6. Springer, 2007. [29] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. ar Xiv preprint ar Xiv:1610.03483, 2016. [30] Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev gan. ar Xiv preprint ar Xiv:1711.04894, 2017. [31] Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997. [32] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent GAN optimization is locally stable. In Advances in Neural Information Processing Systems, pages 5591 5600, 2017. [33] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271 279, 2016. [34] Mark Owen. Practical signal processing. Cambridge university press, 2007. [35] Aaditya Ramdas, Sashank Jakkam Reddi, Barnabás Póczos, Aarti Singh, and Larry A Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In AAAI, pages 3571 3577, 2015. [36] Aaditya Ramdas, Nicolás García Trillos, and Marco Cuturi. On wasserstein two-sample testing and related families of nonparametric tests. Entropy, 19(2):47, 2017. [37] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014. [38] Christian P Robert. Monte Carlo methods. Wiley Online Library, 2004. [39] Shashank Singh and Barnabás Póczos. Minimax distribution estimation in Wasserstein distance. ar Xiv preprint ar Xiv:1802.08855, 2018. [40] Nickolay Smirnov. Table for estimating the goodness of fit of empirical distributions. The annals of mathematical statistics, 19(2):279 281, 1948. [41] Bharath Sriperumbudur et al. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839 1893, 2016. [42] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG Lanckriet. Non-parametric estimation of integral probability metrics. In Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on, pages 1428 1432. IEEE, 2010. [43] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Gert RG Lanckriet, et al. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550 1599, 2012. [44] D. Sutherland, H-Y Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton. Generative models and model criticism via optimized maximum mean discrepancy. In ICLR, 2017. URL https: //arxiv.org/abs/1611.04488. [45] Gábor J Székely, Maria L Rizzo, Nail K Bakirov, et al. Measuring and testing dependence by correlation of distances. The annals of statistics, 35(6):2769 2794, 2007. [46] Ilya Tolstikhin, Bharath K Sriperumbudur, and Krikamol Muandet. Minimax estimation of kernel mean embeddings. The Journal of Machine Learning Research, 18(1):3002 3048, 2017. [47] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009. [48] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. [49] Larry Wasserman. All of Nonparametric Statistics. Springer Science & Business Media, 2006. [50] Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. ar Xiv preprint ar Xiv:1707.00087, 2017. [51] Holger Wendland. Scattered data approximation, volume 17. Cambridge university press, 2004. [52] Dmitry Yarotsky. Error bounds for approximations with deep Re LU networks. Neural Networks, 94: 103 114, 2017.