# on_the_discriminationgeneralization_tradeoff_in_gans__4f59c2de.pdf Published as a conference paper at ICLR 2018 ON THE DISCRIMINATION-GENERALIZATION TRADEOFF IN GANS Pengchuan Zhang Microsoft Research, Redmond penzhan@microsoft.com Qiang Liu Computer Science, Dartmouth College qiang.liu@dartmouth.edu Dengyong Zhou Google dennyzhou@google.com Tao Xu Computer Science, Lehigh University tax313@lehigh.edu Xiaodong He Microsoft Research, Redmond xiaohe@microsoft.com Generative adversarial training can be generally understood as minimizing certain moment matching loss defined by a set of discriminator functions, typically neural networks. The discriminator set should be large enough to be able to uniquely identify the true distribution (discriminative), and also be small enough to go beyond memorizing samples (generalizable). In this paper, we show that a discriminator set is guaranteed to be discriminative whenever its linear span is dense in the set of bounded continuous functions. This is a very mild condition satisfied even by neural networks with a single neuron. Further, we develop generalization bounds between the learned distribution and true distribution under different evaluation metrics. When evaluated with neural distance, our bounds show that generalization is guaranteed as long as the discriminator set is small enough, regardless of the size of the generator or hypothesis set. When evaluated with KL divergence, our bound provides an explanation on the counter-intuitive behaviors of testing likelihood in GAN training. Our analysis sheds lights on understanding the practical performance of GANs. 1 INTRODUCTION Generative adversarial networks (GANs) (Goodfellow et al., 2014) and their variants can be generally understood as minimizing certain moment matching loss defined by a set of discriminator functions. Mathematically, GANs minimize the integral probability metric (IPM) (M uller, 1997), that is, d F(ˆµm, ν) := sup f F Ex ˆµm[f(x)] Ex ν[f(x)] , (1) where ˆµm is the empirical measure of the observed data, and F and G are the sets of discriminators and generators, respectively. 1. Wasserstain GAN (W-GAN) (Arjovsky et al., 2017). F = Lip1(X) := {f : ||f||Lip 1}, corresponding to the Wasserstain-1 distance. 2. MMD-GAN (Li et al., 2015; Dziugaite et al., 2015; Li et al., 2017a). F is taken as the unit ball in a Reproducing Kernel Hilbert Space (RKHS), corresponding to the Maximum Mean Discrepency (MMD). 3. Energy-based GANs (Zhao et al., 2016). F is taken as the set of continuous functions bounded between 0 and M for some constant M > 0, corresponding to the total variation distance (Arjovsky et al., 2017). Published as a conference paper at ICLR 2018 4. f-GAN (Nowozin et al., 2016) minimizes the f-divergence, which can be viewed a form of regularized moment matching loss defined over all possible functions as shown by Liu et al. (2017). See also Appedix B. Due to computational tractability, however, the practical GANs take F as a parametric function class, typically, Fnn = {fθ(x): θ Θ} where fθ(x) is a neural network indexed by parameters θ that take values in Θ Rp. Consequently, the related d Fnn(µ, ν) is called neural network distance, or neural distance (Arora et al., 2017). Although d Fnn(µ, ν) is meant to be a surrogate, its properties can be fundamentally different from the original objective functions. For example, in W-GAN, because Fnn is a much smaller discriminator set than Lip1(X), it is unclear from the current GAN literature whether d Fnn(µ, ν) is a discriminative metric in that d Fnn(µ, ν) = 0 implies µ = ν. This discrimination is critical to ensure the consistency of the learning result. This motivated us to study the properties of d Fnn(µ, ν) with parametric function sets Fnn, instead of the original Wasserstein distance or f-divergence. A broader question is in developing learning bounds and studying how they depend on the discriminator set F and the generator set G, under different evaluation metrics of interest. Specifically, assuming νm is an (approximate) solution of (1), we are interested in obtaining bounds between νm and the underlying true distribution µ, under a given evaluation metric deval(µ, νm). Existing analysis has been mostly focusing on the case when the evaluation metric coincides with the optimization metric, that is, deval(µ, ν) = d F(µ, ν), which, however, favors smaller discriminator sets that define easier evaluation metrics. It is of interest to develop bounds for evaluation metrics independent of F, such as bounded Lipschitz distance that metrizes weak convergence, and KL divergence that connects to testing likelihood. Contribution. We show that the role of discriminators F is best illustrated by the conditions under which d F(µ, ν) metrizes weak convergence (or convergence in distribution), that is, d F(µ, νm) 0 if and only if νm µ, (2) for any probability measures µ and νm. The choice of F should strike a balance to achieve (2): i) F should be large enough to make d F(µ, ν) discrimiantive in that d F(µ, νm) 0 can imply that νm weakly converges to µ. Further, with a given metric deval(µ, ν), the discriminator set F should be large enough so that a small d F(µ, ν) implies a small deval(µ, ν) in certain sense. These are basic requirements in justifying d F(µ, ν) as a valid learning objective function. ii) F should also be relatively small so that νm µ implies that d F(µ, νm) approaches to zero. This is essential to guarantee that the training and testing loss are similar to each other and hence the algorithm is generalizable. Further, in order to obtain a low sample complexity, F should be sufficiently small so that d F(µ, νm) decays with a fast rate, preferably O(1/ m). The theme of this work is to characterize the conditions under which i) and ii) hold and develop bounds of deval(µ, νm) that characterize the role of discriminators F and generators G. Our contributions are summarized as follows. 1. We show that a discriminator set F is discriminative once the linear span of F is dense in the bounded continuous (or Lipschitz) function space. This is a mild condition that can satisfied, for example, even for neural networks consists of a single neuron. See Section 2. 2. We develop techniques using neural distance d F(µ, ν) to provide upper bounds of different evaluation metrics deval(µ, ν) of interest, including bounded Lipschitz (BL) distance and KL divergence, which provides a key step for developing learning bounds of GANs under these metrics. See Section 2.1. 3. We characterize the generalizability of GANs using the Rademacher complexity of discriminator set F and put together bounds between the true distributions µ and GAN estimators νm under different evaluation metrics in Section 3. Under the neural distance, our bounds (Corollary 3.23.3) show that generalization is guaranteed as long as the discriminator set is small enough, regardless of the size of the generator or hypothesis set G. This seemingly-surprising result is reasonable because in this case the evaluation metric d F(µ, νm) depends on the discriminator set. Published as a conference paper at ICLR 2018 4. When the KL divergence is used as the evaluation metric, our bound (Corollary 3.5) suggests that the generator and discriminator sets have to be compatible in that the log density ratios of the generators and the true distributions should exist and be included inside the linear span of the discriminator set. The strong condition that log-density ratio should exist partially explains the counter-intuitive behavior of testing likelihood in flow GANs (e.g., Danihelka et al., 2017; Grover et al., 2017). 5. We extend our analysis to study neural f-divergences that are the learning objective of f-GANs, and establish similar results on the discrimination and generalization properties of neural fdivergences; see Appedix B. Different from neural distance, a neural f-divergence is discriminative if linear span of its discriminators without the output activation function is dense in the bounded continuous function space. 1.1 NOTATIONS We use X to denote a subset of Rd. For each continuous function f : X R, we define the maximum norm as f = supx X |f(x)|, and the Lipschitz norm f Lip = sup{|f(x) f(y)|/ x y : x, y X, x = y}, and the bounded Lipschitz (BL) norm f BL = max{ f Lip, f }. The set of continuous functions on X is denoted by C(X), and the Banach space of bounded continuous function is Cb(X) = {f C(X): f < }. The set of Borel probability measures on X is denoted by PB(X). In this paper, we assume that all measures involved belong to PB(X), which is sufficient in all practical applications. We denote by Eµ[f] the integral of f with respect to probability measure µ. The weak convergence, or convergence in distribution, is denoted by νn ν. Given a base measures τ (e.g., Lebesgue measure), the density of µ PB(X), if it exists, is denoted by ρµ = dµ dτ . We do not assume density exists in our main theoretical results, except the cases when discussing KL divergence. 2 DISCRIMINATIVE PROPERTIES OF NEURAL DISTANCES FOR GAN As listed in the introduction, many variants of GAN can be viewed as minimizing the integral probability metric (1). Without loss of generality, we assume that the discriminator set F is even, i.e., f F implies f F. Intuitively speaking, minimizing (1) towards zero corresponds to matching the moments Eµ[f] = Eν[f] for all discriminators f F. In their original formulation, all those discriminator sets are non-parametric, infinite dimensional, and large enough to guarantee that d F(µ, ν) = 0 implies µ = ν. In practice, however, the discriminator set is typically restricted to parametric function classes of form Fnn = {fθ : θ Θ}. When fθ is a neural network, we call d Fnn(µ, ν) a neural distance following Arora et al. (2017). Neural distances are the actual object function that W-GAN optimizes in practice because they can be practically optimized and can leverage the representation power of neural networks. Therefore, it is of great importance to directly study neural distances, instead of Wasserstein metric, in order to understand practical performance of GANs. Because the parameter function set Fnn is much smaller than the non-parametric sets like Lip1(X), a key question is whether Fnn is large enough so that moment matching on Fnn (i.e., d Fnn(µ, ν) = 0) implies µ = ν. It turns out the answer is affirmative once Fnn is large enough so that its linear span (instead of Fnn itself) forms a universal approximator. This is a rather weak condition, which is satisfied even by very small sets such as neural networks with a single neuron. We make this concrete in the following. Definition 2.1. Let (X, d X) be a metric space and F be a set of functions on X. We say that d F(µ, ν) (and F) is discriminative if d F(µ, ν) = 0 µ = ν, for any two Borel probability measures µ, ν PB(X). In other words, F is discriminative if the moment matching on F, i.e., Eµ[f] = Eν[f] for any f F, implies µ = ν. The key observation is that Eµ[f] = Eν[f] for any f F implies the same holds true for all f in the linear span of F. Therefore, it is sufficient to require the linear span of F, instead of F itself, to be large enough to well approximate all the indicator test functions. Published as a conference paper at ICLR 2018 Theorem 2.2. For a given function set F Cb(X), define span F := {α0 + i=1 αifi : αi R, fi F, n N}. (3) Then d F(µ, ν) is discriminative if span F is dense in the space of bounded continuous functions Cb(X) under the uniform norm || || , that is, for any f Cb(X) and ϵ > 0, there exists an fϵ span F such that f fϵ ϵ. An equivalent way to put is that Cb(X) is included in the closure of span F, that is, cl(span F) Cb(X). (4) Further, (4) is a necessary condition for d F(µ, ν) to be discriminative if X is a compact space. Remark 2.1. The basic idea of characterizing probability measures using functions in Cb(X) is closely related to the concept of weak convergence. Recall that a sequence νn weakly converges to µ, i.e., νn µ, if and only if Eνn[f] Eµ[f] for all f Cb(X). Proof of the sufficient part of Theorem 2.2 is standard and the same with proof of the uniqueness of weak convergence; see, e.g., Lemma 9.3.2 in Dudley (2002). Remark 2.2. We obtain similar results for neural f-divergence dφ,F(µ||ν) in Theorem B.1. The difficulty in analyzing neural f-divergence is that moment matching on the discriminator set is only a sufficient condition for minimizing neural f-divergence, i.e., {ν : Eµ[f] = Eν[f], f F} ν : dφ,F(µ||ν) = arg min ν dφ,F(µ||ν) . Consequently, cl(span F) Cb(X) is only necessary but not sufficient for a neural f-divergence to be discriminative. When the discriminators are neural networks, we show that moment matching on the function set that consists of discriminators without their output activation function, denoted as F0, is a necessary condition for minimizing neural f-divergence, i.e., ν : dφ,F(µ||ν) = arg min ν dφ,F(µ||ν) {ν : Eµ[f0] = Eν[f0], f0 F0} . We refer to Theorem B.1 (ii) for a precise statement. Therefore, a neural f-divergence is discriminative if linear span of its discriminators without the output activation function is dense in the bounded continuous function space. Remark 2.3. Because the set of bounded Lipschitz functions BL(X) = {f Cb(X): ||f||Lip < } is dense in Cb(X), the condition in (4) can be replaced by a weaker condition cl(span F) BL(X). One can define a norm BL for functions in BL(X) by f BL = max{ f Lip, f }. This defines the bounded Lipschitz (BL) distance, d BL(µ, ν) = max f BL(X){Eµ f Eν f : ||f||BL 1}. The BL distance is known to metrize weak convergence in sense that d BL(µ, νn) 0 is equivalent to νn µ for all Borel probability measures on Rd; see section 8.3 in Bogachev (2007). Neural distances are discriminative. The key message of Theorem 2.2 is that it is sufficient to require cl(span F) Cb(X) (Condition (4)), which is a much weaker condition than the perhaps more straightforward condition cl(F) Cb(X). In fact, (4) is met by function sets that are much smaller than what we actually use in practice. For example, it is satisfied by the neural networks with only a single neuron, i.e., Fnn = {σ(w x + b): w Rd, b R}. (5) This is because its span span Fnn includes neural networks with infinite numbers of neurons, which are well known to be universal approximators in Cb(X) according to classical theories (e.g., Cybenko, 1989; Hornik et al., 1989; Hornik, 1991; Leshno et al., 1993; Barron, 1993). We recall the following classical result. Theorem 2.3 (Theorem 1 in Leshno et al. (1993)). Let σ : R R be a continuous activation function and X Rd be any compact set. Let Fnn be the set of neural networks with a single neuron as defined in (5), then span Fnn is dense in C(X) if and only if σ is not a polynomial. Published as a conference paper at ICLR 2018 The above result requires that the parameters [w, b] take values in Rd+1. In practice, however, we can only efficiently search in bounded parameter sets of [w, b] using local search methods like gradient descent. We observe that it is sufficient to replace Rd+1 with a bounded parameter set Θ for non-decreasing homogeneous activation functions such as σ(u) = max{u, 0}α with α N; note that α = 1 is the widely used rectified linear unit (Re LU). Corollary 2.4. Let X Rd be any compact set, and σ(u) = max{u, 0}α (α N), and Fnn = {σ(w x + b): [w, b] Θ}. Then span Fnn is dense in Cb(X) if {λθ : λ 0, θ Θ} = Rd+1. For the case when Θ = {θ Rd+1 : θ 2 1}, Bach (2017) not only proves that span Fnn is dense in Lip1(X) (and thus dense in Cb(X)), but also gives the convergence rate. Therefore, for Re LU activation functions, Fnn with bounded parameter sets, like {θ : θ 1} or {θ : θ = 1} for any norm on Rd+1, is sufficient to discriminate any two Borel probability measures. Note that this is not true for some other activation functions such as tanh or sigmoid, because there is an approximation gap between span{σ(w x + b) : [w, b] Θ Rd+1} and Cb(X) when Θ Rd+1 is bounded; see e.g., Barron (1993) (Theorem 3). From this perspective, homogeneous activation functions such as Re LU are preferred as discriminators. One advantage of using bounded parameter set Θ is that it makes Fnn have a bounded Lipschitz norm, and hence the corresponding neural distance is upper bounded by Wasserstein distance. In fact, W-GAN uses weight clipping to explicitly enforce θ δ. However, we should point out that the Lipschitz constraint does not help in making F discriminative since the constraint decreases, instead of enlarges, the function set F. Instead, the role of the Lipschitz constraint should be mostly in stabilizing the training (Arjovsky et al., 2017) and assuring a generalization bound as we discuss in Section 3. Another related way to justify the Lipschitz constraint is its relation to metrizing weak convergence, as we discuss in the sequel. Neural distance and weak convergence. If F is discriminative, then d F(µ, ν) = 0 implies µ = ν. In practice, however, we often cannot achieve d F(µ, ν) = 0 strictly. Instead, we often have d F(µ, νn) 0 for a sequence of νn and want to establish the weak convergence νn µ. Theorem 2.5. Let (X, d X) be any metric space. If span F is dense in Cb(X), we have limn d F(µ, νn) = 0 implies νn weakly converges to µ. Additionally, if F is contained in a bounded Lipchitz function space, i.e., there exists 0 < C < such that ||f||BL C for all f F, then νn weakly converges to µ implies limn d F(µ, νn) = 0. Theorem 10 of Liu et al. (2017) states a similar result for generic adversarial divergences, but does not obtain the specific weak convergence result for neural distances due to lacking of Theorem 2.2. Another difference is that Theorem 10 of Liu et al. (2017) heavily relies on the compactness assumption of X, while our result does not need this assumption. We provide the proof for Theorem 2.5 in Appendix C. When X is compact, Wasserstein distance and the BL distance are equivalent and both metrize weak convergence. As we discussed earlier, the condition cl(span F) = Cb(X) and F Lip K(X) are satisfied by neural networks Fnn with Re LU activation function and bounded parameter set Θ. Therefore, the related neural distance d Fnn is topologically equivalent to the Wasserstein and BL distance, because all of them metrize the weak convergence. This does not imply, however, that they are equivalent in the metric sense (or strongly equivalent) since the ratio d BL(µ, ν)/d Fnn(µ, ν) can be unbounded. In general, the neural distances are weaker than the BL distance because of smaller F. In Section 2.1 (and particularly Corollary 2.8), we draw more discussions on the bounds between BL distance and neural distances. 2.1 DISCRIMINATIVE POWER OF NEURAL DISTANCES Theorem 2.2 characterizes the condition under which a neural distance is discriminative, and shows that even neural networks with a single neuron are sufficient to be discriminative. This does not explain, however, why it is beneficial to use larger and deeper networks as we do in practice. What is missing here is to frame and understand how discriminative or strong a neural distance is. This is Published as a conference paper at ICLR 2018 because even if d F(µ, ν) is discriminative, it can be relatively weak in that d F(µ, ν) may be small when µ and ν are very different under standard metrics (e.g., BL distance). Obviously, a larger F yields a stronger neural distance, that is, if F F , then d F(µ, ν) d F (µ, ν). For example, because it is reasonable to assume that neural networks are bounded Lipschitz when X and Θ are bounded, we can control a neural distance with the BL distance: d F(µ, ν) Cd BL(µ, ν), where C := supf F{||f||BL} < . A more difficult question is if we can establish inequalities in the other direction, that is, controlling d BL(µ, ν), or in general a stronger d F (µ, ν), with a weaker d F(µ, ν) in some way. In this section, we characterize conditions under which this is possible and develop bounds that allow us to use neural distances to control stronger distances such as BL distance, and even KL divergence. These bounds are used in Section 3 to translate generalization bounds in d F(µ, ν) to that in BL distance and KL divergence. The core of the discussion involves understanding how d F(µ, ν) can be used to control the difference of the moment | Eµ g Eν g| for g outside of F. We address this problem by two steps: first controlling functions in span F, and then functions in cl(span F) that is large enough to include Cb(X) for neural networks. Controlling functions in span F. We start with understanding how d F(µ, ν) can bound | Eµ g Eν g| for g span F. This can be characterized by introducing a notion of norm on span F. Proposition 2.6. For each g span F that can be decomposed into g = Pn i=1 wifi + w0 as we define in (3), the F-variation norm ||g||F,1 of g is the infimum of Pn i=1 |wi| among all possible decompositions of g, that is, ||g||F,1 = inf n X i=1 |wi|: g = i=1 wifi + w0, n N, w0, wi R, fi F . Then we have | Eµ g Eν g| ||g||F,1 d F(µ, ν), g span F. Intuitively speaking, ||g||F,1 denotes the minimum number of functions in F needed to represent g. As F becomes larger, ||g||F,1 decreases and d F(µ, ν) can better control | Eµ g Eν g|. Precisely, if F F then ||g||F ,1 ||g||F,1. Therefore, although adding more neurons in F may not necessarily enlarge span F, it decreases ||g||F,1 and yields a stronger neural distance. Controlling functions in cl(span F). A more critical question is how the neural distance d F(µ, ν) can also control the discrepancy Eµ g Eν g for functions outside of span F but inside cl(span F). The bound in this case is characterized by a notion of error decay function defined as follows. Proposition 2.7. Given a function g, we say that g is approximated by F with error decay function ϵ(r) if for any r 0, there exists an fr span F with ||fr||F,1 r such that ||f fr|| ϵ(r). Therefore, g cl(span F) if and only if infr 0 ϵ(r) = 0. We have | Eµ g Eν g| inf r 0{2ϵ(r) + r d F(µ, ν)}. In particular, if ϵ(r) = O(r κ) for some κ > 0, then | Eµ g Eν g| = O(d F(µ, ν) κ κ+1 ). It requires further efforts to derive the error decay function for specific F and g. For example, Proposition 6 of Bach (2017) allows us to derive the decay rate of approximating bounded Lipschitz functions with rectified neurons, yielding a bound between BL distance and neural distance. Corollary 2.8. Let X be the unit ball of Rd under norm || ||q for some q [2, ), that is, X = {x Rd : ||x||q 1}. Consider F consisting of a single rectified neuron F = {max(v [x; 1], 0)α : v Rd+1, ||v||p = 1} where α N, 1 q = 1. Then we have d BL(µ, ν) = O(d F(µ, ν) 1 α+(d+1)/2 ), (6) where O denotes the big-O notation ignoring the logarithm factor. The result in (6) shows that d F(µ, ν) gives an increasingly weaker bound when the dimension d increases. This is expected because we approximate a non-parametric set with a parametric one. Published as a conference paper at ICLR 2018 Likelihood and KL divergence. Maximum likelihood has been the predominant approach in statistical learning, and testing likelihood forms a standard criterion for testing unsupervised models. The recent advances in deep unsupervised learning, however, make it questionable whether likelihood is the right objective for training and evaluation (e.g., Theis et al., 2015). For example, some recent empirical studies (e.g., Danihelka et al., 2017; Grover et al., 2017) showed a counter-intuitive phenomenon that both the testing and training likelihood (assuming generators with valid densities are used) tend to decrease, instead of increase, as the GAN loss is minimized. A hypothesis for explaining this is that the neural distances used in GANs are too weak to control the KL divergence properly. Therefore, from the theoretical perspective, it is desirable to understand under what conditions (even if it is a very strong one), the neural distance can be strong enough to control KL divergence. This can be done by the following simple result. Proposition 2.9. Assume µ and ν have positive density functions ρµ(x) and ρν(x), respectively. Then KL(µ||ν) + KL(ν||µ) = Eµ log(ρµ/ρν) Eν log(ρµ/ρν). If log(ρµ/ρν) span F, then KL(µ||ν) + KL(ν||µ) || log(ρµ/ρν)||F,1 d F(µ, ν). (7) If log(ρµ/ρν) cl(span F) with an error decay function ϵ(r) = O(r κ), then KL(µ||ν) + KL(ν||µ) = O(d F(µ, ν) κ κ+1 ). (8) This result shows that we require that the density ratio log(ρµ/ρν) should exist and behave nicely in span F or cl(span F) in order to bound KL divergence with d F(µ, ν). If either µ or ν is an empirical measure, the bound is vacuum since KL(µ, ν) + KL(µ, ν) equals infinite, while d F(µ, ν) remains finite once F is bounded, i.e., ||f|| < for all f F. Obviously, this strong condition is hard to satisfy in practice, because practical data distributions and generators in GANs often have no densities or at least highly peaky densities. We draw more discussions in Corollary 3.5. 3 GENERALIZATION PROPERTY OF GANS Section 2 suggests that it is better to use larger discriminator set F in order to obtain stronger neural distance. However, why do regularization techniques, which effectively shrink the discriminator set, help GAN training in practice? The answer has to do with the fact that we observe the true model µ only through an i.i.d. sample of size m (whose empirical measure is denoted by ˆµm), and hence can only optimize the empirical loss d F(ˆµm, ν), instead of the exact loss d F(µ, ν). Therefore, generalization bounds are required to control the exact loss d F(µ, ν) when we can only minimize its empirical version d F(ˆµm, ν). Specifically, let G be a class of generators that may or may not include the unknown true distribution µ. Assume νm minimizes the GAN loss d F(ˆµm, ν) up to an ϵ (ϵ 0) accuracy, that is, d F(ˆµm, νm) inf ν G d F(ˆµm, ν) + ϵ. (9) We are interested in bounding the difference between νm and the unknown µ under certain evaluation metric. Depending on what we care about, we may be interested in the generalization error in terms of the neural distance d F(µ, νm), or other standard quantities of interest such as BL distance d BL(µ, νm) and KL divergence KL(µ, νm) or the testing likelihood. In this section, we adapt the standard Rademacher complexity argument to establish generalization bounds for GANs. We show that the discriminator set F should be small enough to be generalizable, striking a tradeoff with the other requirement that it should be large enough to be discriminative. We first present the generalization bound under neural distance, which purely depends on the Rademacher complexity of the discriminator set F and is independent of the generator set G. Then using the results in Section (2.1), we discuss the generalization bounds under other standard metrics, like BL distance and KL divergence. Published as a conference paper at ICLR 2018 3.1 GENERALIZATION UNDER NEURAL DISTANCE Using the standard derivation and the optimality condition (9), we have (see Appendix D) d F(µ, νm) inf ν G d F(µ, ν) 2 sup f F |Eµ[f] Eˆµm[f]| + ϵ = 2d F(µ, ˆµm) + ϵ. (10) This reduces the problem to bounding the discrepancy d F(µ, ˆµm) := supf F |Eµ[f] Eˆµm[f]| between the true model µ and its empirical version ˆµm. This can be achieved by the uniform concentration bounds developed in statistical learning theory (e.g., Vapnik & Vapnik, 1998) and empirical process (e.g., Van de Geer, 2000). In particular, the concentration property related to supf F |Eµ[f] Eˆµm[f]| can be characterized by the Rademacher complexity of F (w.r.t. measure µ), defined as R(µ) m (F) := E where the expectation is taken w.r.t. Xi µ, and Rademacher random variable τi: prob(τi = 1) = prob(τi = 1) = 1/2. Intuitively, R(µ) m (F) characterizes the ability of overfitting with pure random labels using functions in F and hence relates to the generalization bounds. Standard results in learning theory show that sup f F |Eµ[f] Eˆµm[f]| R(µ) m (F) + O( where = supf F ||f|| . Combining this with (10), we obtain the following result. Theorem 3.1. Assume that the discriminator set F is even, i.e., f F implies f F, and that all discriminators are bounded by , i.e., f for any f F. Let ˆµm be an empirical measure of an i.i.d. sample of size m drawn from µ. Assume νm G satisfies d F(ˆµm, νm) infν G d F(ˆµm, ν) + ϵ. Then with probability at least 1 δ, we have d F(µ, νm) inf ν G d F(µ, ν) 2R(µ) m (F) + 2 m + ϵ, (12) where R(µ) m (F) is the Rademacher complexity of F defined in (11). We obtain nearly the same generalization bound for neural f-divergence in Theorem B.3. Theorem 3.1 relates the generalization error of GANs to the Rademacher complexity of the discriminator set F. The smaller the discriminator set F is, the more generalizable the result is. Therefore, the choice of F should strike a subtle balance between the generalizability and the discriminative power: F should be large enough to make d F(µ, ν) discriminative as we discuss in Section 2.1, and simultaneously should be small enough to have a small generalization error in (12). It turns out parametric neural discriminators strike a good balance for this purpose, given that it is both discriminative as we show in Section 2.1, and give small generalization bound as we show in the following. Corollary 3.2. Let X be the unit ball of Rd under norm || ||2, that is, X = {x Rd : ||x||2 1}. Assume that F is neural networks with a single rectified linear unit (Re LU) F = {max(v [x; 1], 0): v Rd+1, ||v||2 = 1}. Then with probability at least 1 δ, d F(µ, νm) inf ν G d F(µ, ν) + C m + ϵ (13) d BL(µ, νm) = O inf ν G d F(µ, ν) + C m + ϵ 1 (d+3)/2 , (14) where C = 4 log(1/δ) and O denotes the big-O notation ignoring the logarithm factor. Published as a conference paper at ICLR 2018 Note that the three terms in Eqn. (13) take into account the modeling error (infv G d F(µ, ν)), sample complexity and generalization error (C/ m), and optimization error (ϵ), respectively. Assuming zero modeling error and optimization error, we have (1) d F(µ, νm) = O(m 1/2), which achieves the typical parametric convergence rate; (2) d BL(µ, νm) = O(m 1 d+3 ), which becomes slower as the dimension d increases. This decrease is because of the non-parametric nature of BL distance, instead of learning algorithm. As we show in Appendix A, we obtain a similar rate of d BL(µ, νm) = O(m 1 d ), even if we directly use BL distance as the learning objective. Similar results can be obtained for general parametric discriminator class as follows. Corollary 3.3. Under the condition of Theorem 3.1, we further assume that (1) F = {fθ : θ Θ [ 1, 1]p} is a parametric function class with p parameters in a bounded set Θ and that (2) every fθ is L-Lipschitz continuous with respect to the parameters θ, i.e., fθ fθ L θ θ 2. Then with probability at least 1 δ, we have d F(µ, νm) inf ν G d F(µ, ν) + C m + ϵ, (15) where C = 16 2πp L + 2 p 2 log(1/δ). This result can be easily applied to neural discriminators, since neural networks fθ(x) are generally Lipschitz w.r.t. the parameter θ, once the input domain X is bounded. For neural discriminators, we also apply the bound on the Rademacher complexity of DNNs recently derived in Bartlett et al. (2017), which gives a sharper bound than that in Corollary 3.3; see Appendix A.1. With the basic result in Theorem 3.1, we can also discuss the learning bounds of GANs with choices of non-parametric discriminators. Making use of Rademacher complexity of bounded sets in a RKHS (e.g., Lemma 22 in Bartlett & Mendelson (2003)), we give the learning bound of MMDbased GANs (Li et al., 2015; Dziugaite et al., 2015) as follows. We present the results for Wasserstein distance and total variance distance in Appendix A.2, and highlight the advantages of using parametric neural discriminators. Corollary 3.4. Under the condition of Theorem 3.1, we further assume that F = {f H: f H 1} where H is a RKHS whose positive definite kernel k(x, x ) satisfies k(x, x) Ck < + for all x X. Then with probability at least 1 δ, d F(µ, νm) inf ν G d F(µ, ν) + C m + ϵ, (16) where C = 2 2 + p 2 log(1/δ) Ck. Remark 3.1 (Comparisons with results in Arora et al. (2017)). Arora et al. (2017) also discussed the generalization properties of GANs under a similar framework. In particular, they developed bounds of form |d F(µ, ν) d F(ˆµm, ˆνm)| where ˆµm and ˆνm are empirical versions of the target distribution µ and ν with sample size m. Our framework is similar, but considers bounding the quantity d F(µ, νm) infν G d F(µ, ν), which is of more direct interest. In fact, our Eqn. (10) shows that our generalization error can be bounded by the generalization error studied in Arora et al. (2017). Another difference is that we adapt the Rademacher complexity argument to derive the bound, while Arora et al. (2017) made use of the ϵ-net argument. Bounding the KL divergence and testing likelihood. The above results depend on the evaluation metric we use, which is d F(µ, ν) or d BL(µ, ν). If we are interested in evaluating the model using even stronger metrics, such as KL divergence or equivalently testing likelihood, then the generator set G enters the scene in a more subtle way, in that a larger generator set G should be companioned with a larger discriminator set F in order to provide meaningful bounds on KL divergence. This is illustrated in the following result obtained by combining Theorem 3.1 and Proposition 2.9. Corollary 3.5. Assume both the true µ and all the generators ν G have positive densities ρµ and ρν, respectively. Assume F consists of bounded functions with := supf F ||f|| < . Further, assume the discriminator set F is compatible with the generator set G in the sense that log(ρν/ρµ) span F, ν G, with a compatible coefficient defined as ΛF,G := sup ν G || log(ρν/ρµ)||F,1 < . Published as a conference paper at ICLR 2018 KL(µ, νm) ΛF,G 2R(µ) m (F) + 2 p 2 log(1/δ)/m + inf ν G KL(µ, ν) + ϵ . (17) Different from the earlier bounds, the bound in (17) depends on the compatibility coefficient ΛF,G that casts a more interesting trade-off on the choice of the generator set G: the generator set G should be small and have well-behaved density functions to ensure a small ΛF,G, while should be large enough to have a small modeling error infν G p KL(µ, ν). Related, the discriminator set should be large enough to include all density ratios log(ρµ/ρν) in a ball of radius ΛF,G of span F, and should also be small to have a low Rademacher complexity R(µ) m (F). Obviously, one can also extend Corollary 3.5 using (8) in Proposition 2.7, to allow log(ρµ/ρν) cl(span F) in which case the compatibility of G and F should be mainly characterized by the error decay function ϵ(r). KL(µ, νm) = Eµ[log pµ] Eµ[log pνm] is the difference between the testing likelihood Eµ[log pνm] of estimated model νm and the optimal testing likelihood Eµ[log pµ]. Therefore, Corollary 3.5 also provides a bound for testing likelihood. Unfortunately, the condition in Corollary 3.5 is rather strong, in that it requires that both the true distribution µ and the generators ν have positive densities and that the log-density ratio log(ρµ/ρν) be well-behaved. In practical applications of computer vision, however, both µ and ν tend to concentrate on local regions or sub-manifolds of X, with very peaky densities, or even no valid densities; this causes the compatibility coefficient ΛF,G very large, or infinite, making the bound in (17) loose or vacuum. This provides a potential explanation for some of the recent empirical findings (e.g., Danihelka et al., 2017; Grover et al., 2017) that the negative testing likelihood is uncorrelated with the GAN loss functions, or even increases during the GAN training progress. The underlying reason here is that the neural distance is not strong enough to provide meaningful bound for KL divergence. See Appendix E for an illustration using toy examples. 4 RELATED WORK There is a surge of research interest in GANs; however, most of the work has been empirical in nature. There has been some theoretical literature on understanding GANs, including the discrimination and generalization properties of GANs. The discriminative power of GANs is typically justified by assuming that the discriminator set F has enough capacity. For example, Goodfellow et al. (2014) assumes that F contains the optimal discriminator pdata(x) pdata(x)+pg(x) . Similar capacity assumptions have been made in nearly all other GANs to prove their discriminative power; see, e.g., Zhao et al. (2016); Nowozin et al. (2016); Arjovsky et al. (2017). However, discriminators are in practice taken as certain parametric function class, like neural networks, which violates these capacity assumptions. The universal approximation property of neural networks is used to justify the discriminative power empirically. In this work, we show that the GAN loss is discriminative if span F can approximate any continuous functions. This condition is very weak and can be satisfied even when none of the discriminators is close to the optimal discriminator. The MMD-based GANs (Li et al., 2015; Dziugaite et al., 2015; Li et al., 2017a) avoid the parametrization of discriminators by taking advantage of the close-form solution of the optimal discriminator in the non-parametric RKHS space. Therefore, the capacity assumption is satisfied in MMD-based GANs, and their discriminative power is easily justified. Liu et al. (2017) defines a notion of adversarial divergences that include a number of GAN objective functions. They show that if the objective function is an adversarial divergence with some additional conditions, then using a restricted discriminator family has a moment-matching effect. Our treatment of the neural divergence is directly inspired by them. We refer to Remark B.1 for a detailed comparison. Liu et al. (2017) also shows that for objective functions that are strict adversarial divergence, convergence in the objective function implies weak convergence. However, they do not provide a condition under which an adversarial divergence is strict. A major contribution of our work is to fill this gap, and to provide such a condition that is sufficient and necessary. Dziugaite et al. (2015) studies generalization error, defined as d F(µ, νm) infν G d F(µ, ν) in our notation, for MMD-GAN in terms of fat-shattering dimension. Moreover, Dziugaite et al. (2015) obtains a generalization bound that incorporates the complexity of the hypothesis set G. Although Published as a conference paper at ICLR 2018 their overall error bound is still O(m 1/2), their work shows the possibility to sharpen our Gindependent bound. Arora et al. (2017) studies the generalization properties of GANs through the quantity d F(µ, ν) d F(ˆµm, ˆνm) (in our notations). The main difference between our work and Arora et al. (2017) is the definition of generalization error; see more discussions in Remark 3.1. Moreover, Arora et al. (2017) allows only polynomial number of samples from the generated distribution because the training algorithm should run in polynomial time. We do not consider this issue because in this work we only study the statistical properties of the objective functions and do not touch the optimization method. Finally, Arora et al. (2017) shows that the GAN loss can approach its optimal value even if the generated distribution has very low support, and (Arora & Zhang, 2017) provides empirical evidence for this problem. Our result is consistent with their results because our generalization error is measured by the neural distance/divergence. Finally, there are some other lines of research on understanding GANs. Li et al. (2017b) studies the dynamics of GAN s training and finds that: a GAN with an optimal discriminator provably converges, while a first order approximation of the discriminator leads to unstable dynamics and mode collapse. Lei et al. (2017) studies WGAN and optimal transportation by convex geometry and provides a close-form formula for the optimal transportation map. Hu et al. (2017) provides a new formulation of GANs and variational autoencoders (VAEs), and thus unifies the most two popular methods to train deep generative models. We d like to mention other recent interesting research on GANs, e.g., (Guo et al., 2017; Sinn & Rawat, 2017; Nock et al., 2017; Mescheder et al., 2017; Tolstikhin et al., 2017; Heusel et al., 2017). 5 CONCLUSIONS We studied the discrimination and generalization properties of GANs with parameterized discriminator class such as neural networks. A neural distance is guaranteed to be discriminative whenever the linear span of its discriminator set is dense in the bounded continuous function space. On the other hand, a neural divergence is discriminative whenever the linear span of features defined by the last linear layer of its discriminators is dense in the bounded continuous function space. We also provided generalization bounds for GANs in different evaluation metrics. In terms of neural distance, our bounds show that generalization is guaranteed as long as the discriminator set is small enough, regardless of the size of the generator or hypothesis set. This raises an interesting discriminationgeneralization balance in GANs. Fortunately, several GAN methods in practice already choose their discriminator set at the sweet point, where both the discrimination and generalization hold. Finally, our generalization bound in KL divergence provides an explanation on the counter-intuitive behaviors of testing likelihood in GAN training. There are several directions that we would like to explore in the future. First of all, in this paper, we do not talk about methods to compute the neural distance/divergence. This is typically a non-concave maximization problem and is extremely difficult to solve. Many methods have been proposed to solve this kind of minimax problems, but both stable training methods and theoretical analysis of these algorithms are still missing. Secondly, our generalization bound depends purely on the discriminator set. It is possible to obtain sharper bounds by incorporating structural information from the generator set. Finally, we would like to extend our analysis to conditional GANs (see, e.g., Mirza & Osindero (2014); Springenberg (2015); Chen et al. (2016); Odena et al. (2016)), which have demonstrated impressive performance (Reed et al., 2016a;b; Zhang et al., 2017). Anonymous. Spectral normalization for generative adversarial networks. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id= B1QRgzi T-. Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214 223, 2017. Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. ar Xiv preprint ar Xiv:1706.08224, 2017. Published as a conference paper at ICLR 2018 Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). ar Xiv preprint ar Xiv:1703.00573, 2017. Francis Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1 53, 2017. Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930 945, 1993. Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463 482, March 2003. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=944919.944944. Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6241 6250, 2017. Vladimir I Bogachev. Measure theory, volume 1. Springer Science & Business Media, 2007. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172 2180, 2016. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303 314, 1989. Ivo Danihelka, Balaji Lakshminarayanan, Benigno Uria, Daan Wierstra, and Peter Dayan. Comparison of maximum likelihood and gan-based training of real nvps. ar Xiv preprint ar Xiv:1705.05263, 2017. R. M. Dudley. Real Analysis and Probability. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2 edition, 2002. doi: 10.1017/CBO9780511755347. Gintare Karolina Dziugaite, Daniel M. Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI 15, pp. 258 267, Arlington, Virginia, United States, 2015. AUAI Press. ISBN 978-0-9966431-0-8. URL http://dl.acm.org/ citation.cfm?id=3020847.3020875. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Aditya Grover, Manik Dhar, and Stefano Ermon. Flow-gan: Bridging implicit and prescribed learning in generative models. ar Xiv preprint ar Xiv:1705.08868, 2017. Jianbo Guo, Guangxiang Zhu, and Jian Li. Generative adversarial mapping networks. ar Xiv preprint ar Xiv:1709.09820, 2017. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, G unter Klambauer, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a nash equilibrium. ar Xiv preprint ar Xiv:1706.08500, 2017. Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4 (2):251 257, 1991. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, and Eric P Xing. On unifying deep generative models. ar Xiv preprint ar Xiv:1706.00550, 2017. Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793 800, 2009. Published as a conference paper at ICLR 2018 Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013. Na Lei, Kehua Su, Li Cui, Shing-Tung Yau, and David Xianfeng Gu. A geometric view of optimal transportation and generative model. ar Xiv preprint ar Xiv:1710.05488, 2017. Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861 867, 1993. Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab as P oczos. Mmd gan: Towards deeper understanding of moment matching network. ar Xiv preprint ar Xiv:1705.08584, 2017a. Jerry Li, Aleksander Madry, John Peebles, and Ludwig Schmidt. Towards understanding the dynamics of generative adversarial networks. ar Xiv preprint ar Xiv:1706.09884, 2017b. Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1718 1727, 2015. Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of generative adversarial learning. ar Xiv preprint ar Xiv:1705.08991, 2017. Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. ar Xiv preprint ar Xiv:1705.10461, 2017. Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014. Alfred M uller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997. ISSN 00018678. URL http://www.jstor. org/stable/1428011. Richard Nock, Zac Cranko, Aditya Krishna Menon, Lizhen Qu, and Robert C Williamson. f-gans in an information geometric nutshell. ar Xiv preprint ar Xiv:1707.04385, 2017. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271 279, 2016. Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. ar Xiv preprint ar Xiv:1610.09585, 2016. Scott Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. Learning what and where to draw. In NIPS, 2016a. Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text-to-image synthesis. In ICML, 2016b. Mathieu Sinn and Ambrish Rawat. Towards consistency of adversarial training for generative models. ar Xiv preprint ar Xiv:1705.09199, 2017. Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. ar Xiv preprint ar Xiv:1511.06390, 2015. Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch olkopf, and Gert RG Lanckriet. On integral probability metrics,\phi-divergences and binary classification. ar Xiv preprint ar Xiv:0901.2698, 2009. Lucas Theis, A aron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. ar Xiv preprint ar Xiv:1511.01844, 2015. Ilya Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and Bernhard Sch olkopf. Adagan: Boosting generative models. ar Xiv preprint ar Xiv:1701.02386, 2017. Published as a conference paper at ICLR 2018 Sara A Van de Geer. Applications of empirical process theory, volume 91. Cambridge University Press Cambridge, 2000. Vladimir Naumovich Vapnik and Vlamimir Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. ICCV, 2017. Junbo Zhao, Michael Mathieu, and Yann Le Cun. Energy-based generative adversarial network. ar Xiv preprint ar Xiv:1609.03126, 2016. Published as a conference paper at ICLR 2018 A GENERALIZATION ERROR OF OTHER DISCRIMINATOR SETS F A.1 GENERALIZATION BOUND FOR NEURAL DISCRIMINATORS For neural discriminators, we can use the following bound on the Rademacher complexity of DNNs, which was recently proposed in Bartlett et al. (2017). Theorem A.1. Let fixed activation functions (σ1, . . . , σL) and reference matrices (M1, . . . , ML) be given, where σi is ρi-Lipschitz and σi(0) = 0. Let spectral norm bounds (s1, . . . , s L) and matrix (2,1) norm bounds (b1, . . . , b L) be given. Let F denote the discriminator set consisting of all choices of neural network f A: Fnn := {f A : A =: (A1, . . . , AL), Ai σ si, AT i M T i 2,1 bi}, (18) where A σ := σmax(A) and A 2,1 := ( A:,1 2, . . . , A:,m 2) 1 are the matrix spectral norm and (2, 1) norm, respectively, and f A(x) := σL(ALσL 1(AL 1 . . . σ1(A1x) . . . )) is the neural network associated with weight matrices (A1, . . . , AL). Moreover, assume that each matrix in (A1, . . . , AL) has dimension at most W along each axis and define the spectral normalized complexity R as log(2W 2) ΠL j=1sjρj L X Let data matrix X Rm d be given, where the m rows correspond to data points. When the sample size m 3 X F R, the empirical Rademacher complexity satisfies ˆRm(Fnn) := Eτ 1 + log m 3 X F R where τ = (τ1, . . . , τm) are the Rademacher random variables which are iid with Pr[τi = 1] = Pr[τi = 1] = 1/2, X F is the Frobenius norm of X. Proof. The proof is the same with the proof of Lemma A.8 in Bartlett et al. (2017). When m 3 X F R, we use the optimal α = 3 X F R/ m to obtain the above result. Combined with our Theorem 3.1, we obtain the following generalization bound for the neural discriminator set defined in (18). Corollary A.2. Suppose that the discriminator set Fnn is taken as (18) and that f for any f Fnn. Let data matrix X Rm d be the m data points that define the empirical distribution ˆµm. Then with probability at least 1 δ, we have d Fnn(µ, νm) inf ν G d F(µ, ν) + 48 X F R 1 + log m 3 X F R m + ϵ, (21) where R is the spectral normalized complexity defined in (19) and ϵ is the optimization error defined in (9). Proof. In the proof of Theorem 3.1, instead of using sup f F |Eµ[f] Eˆµm[f]| R(µ) m (F) + 2 sup f F |Eµ[f] Eˆµm[f]| ˆRm(F) + 6 2m to revise the generalization bound (12) as d F(µ, νm) inf ν G d F(µ, ν) 2 ˆRm(F) + 6 m + ϵ. (22) Combining the revised bound with Equation (20), we conclude the proof. Published as a conference paper at ICLR 2018 Compared to Corollary 3.3, the bound in (22) gets rid of the number of parameters p, which can be prohibitively large in practice. Moreover, Corollary A.2 can be directly applied to the spectral normalized GANs (Anonymous, 2018), and may give an explanation of the empirical success of the spectral normalization technique. A.2 GENERALIZATION BOUNDS FOR NON-PARAMETRIC DISCRIMINATOR SETS With the basic result in Theorem 3.1, we can also discuss the learning bounds of GANs with other choices of non-parametric discriminator sets F. This allows us to highlight the advantages of using parametric neural discriminators. For simplicity, we assume zero model error and optimization so that the bound is solely based on the generalization error d F(µ, ˆµm) between µ and its empirical version ˆµm. 1. Bounded Lipschitz distance, F = {f C(X) : ||f||BL 1}, which is equivalent to Wasserstein distance when X is compact. When X is a convex bounded set in Rd, we have R(µ) m (F) m 1/d for d > 2 (see Corollary 12 in Sriperumbudur et al. (2009)), and hence d BL(µ, ν) = O(m 1/d). This is comparable with Corollary 3.2. This bound is tight. Assume that µ is the uniform distribution on X. A simple derivation (similar to Lemma 1 in Arora et al. (2017)) shows that d F(µ, ˆµm) c(1 m exp( Ω(d))) for some constant only depending on X. Therefore, one must need at least m = exp(Ω(d)) samples to reduce d F(µ, ˆµm), and hence the generalization bound, to O(ϵ). 2. Total variation (TV) distance, F = {f C(X) : f min{1, }}. It is easy to verify that R(µ) m (F) = 2. Therefore, Eqn. (12) cannot guarantee generalization even when we have infinite number of samples, i.e., m . The estimate given in Eqn. (12) is tight. Assume that µ is the uniform distribution on X. It is easy to see that d TV(µ, ˆµm) = 2 almost surely. Therefore, νm is close to ˆµm implies that it is order 1 away from µ, which means that generalization does not hold in this case. With the statement that training with the TV distance does not generalize, we mean that training with TV distance does not generalize in TV distance. More precisely, even if the training loss on empirical samples is very small, i.e., TV(ˆµm, νm) = O(ϵ), the TV distance to the unknown target distribution can be large, i.e., d T V (µ, νm) = O(1). However, this does not imply that training with TV distance is useless, because it is possible that training with a stronger metric leads to asymptotic vanishing in a weaker metric. For example, d T V (ˆµm, νm) = O(ϵ) implies d Fnn(ˆµm, νm) = O(ϵ), and thus a small d Fnn(µ, νm). Take the Wasserstein metric as another example, even though we only establish d W (µ, νm) = O(m 1/d) (assuming zero model error (µ G) and optimization ϵ = 0), it does not eliminate the possibility that the weaker neural distance has a faster convergence rate d Fnn(µ, νm) = O(m 1/2). From the practical perspective, however, TV and Wasserstein distances are less clearly favorable than neural distance because the difficulty of calculating and optimizing them. B NEURAL φ-DIVERGENCE f-GAN is another broad family of GANs that are based on minimizing f-divergence (also called φ-divergence) (Nowozin et al., 2016), which includes the original GAN by Goodfellow et al. (2014). 1 However, φ-divergence has substantially different properties from IPM (see e.g., Sriperumbudur et al. (2009)), and is not defined as the intuitive moment matching form as IPM. In this Appendix, we extend our analysis to φ-divergence by interpreting it as a form of penalized moment matching. Similar to the case of IPM, we analyze the neural φ-divergence that restricts the discriminators to parametric function set F for practical computability, and establish its discrimination and generalization properties under mild conditions that practical f-GANs satisfy. Assume that µ and ν are two distributions on X. Given a convex, lower-semicontinuous univariate function φ that satisfies φ(1) = 0, the related φ-divergence is dφ(µ || ν) = Eν φ dµ dν . If φ is strictly convex, then a standard derivation based on Jensen s inequality shows that φ-divergence is 1In this appendix, we call it φ-divergence because f has been used for discriminators. Published as a conference paper at ICLR 2018 nonnegative and discriminative: dφ(µ || ν) φ(1) = 0 and the equality holds iff µ = ν. Different choices of φ recover popular divergences as special cases. For example, φ(t) = (t 1)2 recovers Pearson χ2 divergence, and φ(t) = (u + 1) log((u + 1)/2) + u log u gives the Jensen-Shannon divergence used in the vanilla GAN Goodfellow et al. (2014). B.1 DISCRIMINATIVE POWER OF NEURAL φ-DIVERGENCE In this work, we find it helps to develop intuition by introducing another convex function ψ(t) := φ(t + 1), defined by shifting the input variable of φ by +1; the φ-divergence becomes dφ(µ || ν) = Eν X ρν(x)ψ ρµ(x) ρν(x) 1 τ(dx), (23) where we should require that ψ(0) = 0; in right hand side of (23), we assume ρµ and ρν are the density functions of µ and ν, respectively, under a base measure τ. The key advantage of introducing ψ is that it gives a suggestive variational representation that can be viewed as a regularized moment matching. Specially, assume ψ is the convex conjugate of ψ, that is, ψ (t) = supy{yt ψ(y)}. By standard derivation, we can show that dφ(µ || ν) sup f A (Eµ[f] Eν[f] Ψν,ψ [f]) , with Ψν,ψ [f] := Ex ν[ψ (f(x))], (24) where A is the class of all functions f : X dom(ψ ) where dom(ψ ) = {t: ψ (t) R}, and the equality holds if ϕ ( ρµ(x) ρν(x) 1) A where ϕ is the inverse function of ψ . In (24), the term Ψν,ψ [f], as we show in Lemma B.1 in sequel, can be viewed as a type of complexity penalty on f that ensures the supreme is finite. This is in contrast with the IPM d F(µ, ν) in which the complexity constraint is directly imposed using the function class F, instead of a regularization term. Lemma B.1. Assume ψ: R R { } is a convex, lower-semicontinuous function with conjugate ψ and ψ(0) = 0. The penalty Ψν,ψ [f] in (24) has the following properties i) Ψν,ψ [f] is a convex functional of f, and Ψν,ψ [f] 0 for any f. ii) There exists a constant b0 R { } such that ψ (b0) = 0. Further, if ψ is strictly convex, then Ψν,ψ [f] = 0 implies f(x) = b0 almost surely under measure ν. Proof. i) It is obvious that Ψν,ψ [f] is convex given that f is convex. By the convex conjugate, we have ψ(t) = supy ty ψ (y) . Take t = 0 and note that ψ(0) = 0, then we have ψ (y) 0, y. This proves Ψν,ψ [f] 0. ii) If ψ is strictly convex, then ψ is also strictly convex. This implies there exists at most a single value b0 such that ψ (c) = 0. Given that ψ (y) 0 for y, we arrive that Ex ν[ψ (f(x))] = 0 implies ψ (f(x)) = 0 almost surely under x ν, which then implies f(x) = b0 almost surely. In practice, it is impossible to numerically optimize over the class of all functions in (24). Instead, practical f-GANs restrict the optimization to a parametric set F of neural networks, yielding the following neural φ-divergence: dφ,F(µ || ν) = sup f F (Eµ[f] Eν[f] Ψν,ψ [f]) . (25) Note that this can be viewed as a generalization of the F-related IPM d F(µ, ν) by considering ψ = 0. However, the properties of the neural φ-divergence can be significantly different from that of d F(µ, ν). For example, dφ,F(µ || ν) is not even guaranteed to be non-negative for arbitrary discriminator sets F because of the negative regularization term. Fortunately, we can still establish the non-negativity and discriminative property of dφ,F(µ || ν) under certain weak conditions on F. Moreover, the property that d F(µ, ν) = 0 implies moment matching on F, which is the key step to establish the discriminative power, is not necessarily true for neural divergence. Fortunately, it turns out that dφ,F(µ || ν) = 0 implies moment matching on features defined by the last linear layer of discriminators. Theorem B.1. Assume F includes the constant function b0 R, which satisfies ψ (b0) = 0 as defined in Lemma B.1. We have Published as a conference paper at ICLR 2018 i) 0 dφ,F(µ || ν) d F(µ, ν) for µ, ν. As a result, d F(µ, ν) = 0 implies dφ,F(µ || ν) = 0. In other words, moment matching on F is a sufficient condition of zero neural φ-divergence. ii) Further, we assume F has the following form: F {σ(αf0 + c0): |α| αf0, and f0 F0}, (26) where F0 is any function set, and αf0 > 0 is positive number associated with each f0 F0, and c0 is a constant and σ: R R is any function that satisfies σ(c0) = b0 and σ (c0) > 0. Here σ can be viewed as the output activation function of a deep neural network whose previous layers are specified by F0. Assume ψ (y) is differentiable at y = b0. Then dφ,F(µ || ν) = 0 implies d F0(µ , ν) = 0. In other words, moment matching on F0 is a necessary condition of zero neural φ-divergence. iii) cl(span F0) Cb(X) is a sufficient condition for dφ,F to be discriminative, i.e., dφ,F(µ || ν) = 0 implies µ = ν. Condition (26) defines a commonly used structure of F that naturally satisfied by the f-GANs used in practice; in particular, the output activation function σ plays the role of ensuring the output of F respects the input domain of the convex function ψ . For example, the vanilla GAN has ψ = log(1 exp(t)) t with an input domain of ( , 0), and activation function is taken to be σ(t) = log(1 + exp( t)). See Table 2 of Nowozin et al. (2016) for the list of output activation functions related to commonly used ψ. Proof of Theorem B.1. i) because b0 F and ψ (b0) = 0, we have dφ,F(µ || ν) Eµ[b0] Eν[b0] Ψν,ψ [b0] = 0. Because Ψν,ψ [f] 0, we obtain dφ,F(µ || ν) d F(µ, ν) by comparing (25) with d F(µ, ν) = supf F{Eµ f Eν f}. ii), note that dψ,F(µ || ν) = 0 implies Eµ[f] Eν[f] Ψν,ψ [f], f F. Therefore, Eµ[σ(αf0 + c0)] Eν[σ(αf0 + c0)] Ψν,ψ [σ(αf0 + c0)], f0 F0, |α| αf0. This implies that 1 α(Ex µ[σ(αf0(x) + c0)] Ex ν[σ(αf0(x) + c0)]) 1 α Ex ν[ψ (σ(αf0(x) + c0))]. (27) By the differentiability assumptions, lim α 0 σ(αf0(x) + c0) σ(c0) α = σ (c0)f(x), lim α 0 ψ (σ(αf0(x) + c0)) ψ (σ(c0)) α = ψ (b0)σ (c0)f0(x) = 0, where we used the fact that ψ (σ(c0)) = ψ (b0) = 0 and ψ (b0) = 0 because b0 is a differentiable minimum point of ψ . Taking the limit of α 0 on both sides of (27), we get σ (c0)[Ex µ[f0(x)] Ex ν[f0(x)]] 0, f0 F0. Because σ (c0) > 0 by assumption, this implies Ex µ[f0(x)] Ex ν[f0(x)]. The same argument applies to f0, and we thus we finally obtain Ex µ[f0(x)] = Ex ν[f0(x)]. iii) Combining Theorem 2.2 and the last point, we directly get the result. Remark B.1. Our results on neural φ-divergence can in general extended to the more unified framework of Liu et al. (2017) in which divergences of form maxf E(x,y) µ ν[f(x, y)] are studied. We choose to focus on φ-divergence because of its practical importance. Our Theorem B.1 i) can be viewed as a special case of Theorem 4 of Liu et al. (2017) and our Theorem B.1 ii) is related to Theorem 5 of Liu et al. (2017). However, Theorem 5 of Liu et al. (2017) requires a rather counterintuitive condition, while our condition in Theorem B.1 ii) is clear and satisfied by all φ-divergence listed in Nowozin et al. (2016). Published as a conference paper at ICLR 2018 Similar to Theorem 2.5, under the conditions of Theorem B.1, we have similar results for neural φ-divergence. Theorem B.2. Using the same notions in Theorem B.1, assume that (X, d X) is a compact metric space and that cl(span F0) Cb(X). Then if limn dφ,F(µ || νn) = 0, νn converges to µ in the distribution sense. Further, if there exists C > 0 such that F {f C(X) : f Lip C}, we have lim n dφ,F(µ || νn) = 0 νn µ. Notice that we assume that (X, d X) be a compact metric space here for simplicity. A non-compact result is available but its proof is messy and non-intuitive. Proof. The first half is a direct application of Theorem B.1 and Theorem 10 in Liu et al. (2017). For the second half, we have dφ,F(µ || νn) d F(µ, νn) Cd W (µ, νn), where we use Theorem B.1 i) in the first inequality and the Lipschitz condition of F in the second ineqaulity. Since d W metrizes the weak convergence for compact X, we obtain d W (µ, νn) 0 and thus dφ,F(µ || νn) 0. B.2 GENERALIZATION PROPERTIES OF NEURAL φ-DIVERGENCE Similar to the case of neural distance, we can establish generalization bounds for neural φdivergence. Theorem B.3. Assume that f for any f F. ˆµm is an empirical distribution with m samples from µ, and νm G satisfies dφ,F(ˆµm || νm) infν G dφ,F(ˆµm || ν) + ϵ. Then with probability at least 1 2δ, we have dφ,F(µ || νm) inf ν G dφ,F(µ || ν) + 2R(µ) m (F) + 2 m + ϵ, (28) where R(µ) m (F) is the Rademacher complexity of F. Notice that the only difference between Theorem 3.1 and Theorem B.3 is that the failure probability change from δ to 2δ. This comes from the fact that F is typically not even in the neural divergence case. For example, the vanilla GAN takes σ(t) = log(1 + exp( t)) as the output activation function, and thus f 0 for all f F. Proof of Theorem B.3. With the same argument in Equation (10), we obtain dφ,F(µ || νm) inf ν G dφ,F(µ || ν) 2 sup f F |Eµ[f] Eˆµm[f]| + ϵ. Although F is not even, we have sup f F |Eµ[f] Eˆµm[f]| = max sup f F (Eµ[f] Eˆµm[f]) , sup f F (Eˆµm[f] Eµ[f]) Standard argument on Rademacher complexity (same in the proof of Theorem 3.1) gives with probalibity at least 1 δ, sup f F (Eµ[f] Eˆµm[f]) With the same argument, we obtain that with probalibity at least 1 δ, sup f F (Eˆµm[f] Eµ[f]) Combining all the results above, we conclude the proof. Published as a conference paper at ICLR 2018 With Theorem B.3, we obtain generalization bounds for difference choices of F, as we had in section 3. For example, we have an analog of Corollary 3.3 in the neural divergence setting as follows. Corollary B.4. Under the condition of Theorem B.3, we further assume that (1) F = Fnn = {fθ : θ Θ [ 1, 1]p} is a parametric function class with p parameters in a bounded set Θ and that (2) every fθ is L-Lipschitz continuous with respect to the parameters θ. Then with probability at least 1 2δ, we have dφ,Fnn(µ || νm) inf ν G dφ,Fnn(µ || ν) + C m + ϵ, (29) where C = 16 2πp L + 2 p 2 log(1/δ). C PROOF OF RESULTS IN SECTION 2 Proof of Theorem 2.2. For the sufficient part, the proof is standard and the same as that of the uniqueness of weak convergence. We refer to Lemma 9.3.2 in Dudley (2002) for a complete proof. For the necessary part, suppose that F Cb(X) is discriminative in PB(X). Assume that cl(span(F {1})) is a strictly closed subspace of Cb(X). Take g Cb(X)\cl(span(F)) and g = 1. By the Hahn-Banach theorem, there exists a bounded linear functional L : C(X) R such that L(f) = 0 for any f cl(span(F {1})) and L = 0. Thanks to the Riesz representation theorem for compact metric spaces, there exists a signed, regular Borel measure m MB(X) such that L(f) = Z m f f Cb(X). Suppose m = µ ν are the Hahn decomposition of m, where µ and ν are two nonnegative Borel measures. Then we have L(f) = R ν f for any f Cb(X). Thanks to L(1) = 0, we have 0 < µ(X) = ν(X) < . We can assume that µ and ν are Borel probability measures. (Otherwise, we can use the normalized nonzero linear functional L/µ(X) whose Hahn decomposition consists of two Borel probability measures.) Since L(f) = 0 for any f cl(span(F)), we have R ν f for any f F. Since F Cb(X) is discriminative, we have µ = ν and thus L = 0, which leads to a contradiction. Proof of Corollary 2.4. Thanks to {λθ : λ 0, θ Θ} = Rd+1, for any [w, b] Rn+1, there exists [w0, b0] Θ and λ > 0 such that σ(w x + b) = σ(λ(w 0 x + b0)) = λασ(w 0 x + b0), where we used σ(u) = max{u, 0}α in the last step. Therefore, we have span Fnn span{σ(w x + b): [w, b] Rd+1}. Thanks to Theorem 2.3, we know that span Fnn is dense in Cb(X). Proof of Theorem 2.5. Given a function g Cb(X), we say that g is approximated by F with error decay function ϵ(r) if for any r 0, there exists fr span F with ||fr||F,1 r such that ||f fr|| ϵ(r). Obviously, ϵ(r) is an non-increasing function w.r.t. r. Thanks to cl(span F) = Cb(X), we have limr ϵ(r) = 0. Now denote rn := d F(µ, νn) 1/2 and correspondingly fn := frn. We have | Eµ g Eνn g| | Eµ g Eµ fn| + | Eν g Eν fn| + | Eµ fn Eνn fn| 2ϵ(rn) + rn d F(µ, νn). = 2ϵ(rn) + 1/rn. If limn d F(µ, νn) = 0, we have limn rn = . Thanks to limr ϵ(r) = 0, we prove that limn | Eµ g Eνn g| = 0. Since this holds true for any g Cb(X), we conclude that νn weakly converges to µ. If F BLC(X) for some C > 0, we have d F(µ, ν) Cd BL(µ, ν) for any µ, ν. Because the bounded Lipschitz distance (also called Fortet Mourier distance) metrizes the weak convergence, we obtain that νn µ implies d BL(µ, νn) 0, and thus d F(µ, νn) 0. Published as a conference paper at ICLR 2018 Proof of Proposition 2.6. Let g = Pn i=1 wifi + w0. Then we have | Eµ g Eν g| = i=1 |wi|| Eµ fi Eν fi| The result is obtain by taking infimum over all possible wi. Proof of Proposition 2.7. For any r 0, we have | Eµ g Eν g| | Eµ g Eµ fr| + | Eν g Eν fr| + | Eµ fr Eν fr| 2ϵ(r) + r d F(µ, ν). Taking the infimum on r > 0 on the right side gives the result. Proof of Corollary 2.8. Proposition 5 of Bach (2017) shows that for any bounded Lipschitz function g that satisfies ||g||BL : = max{||g|| , ||g||Lip} η, we have ϵ(r) = O(η(r/η) 1/(α+(d 1)/2) log(r/η)). Using Proposition 2.7, we get | Eµ g Eν g| O(||g||BL d F(µ, ν) 1 α+(d+1)/2 ), The result follows BL(µ, ν) = supg{| Eµ g Eν g|: ||g||BL 1}. D PROOF OF RESULTS IN SECTION 3 Proof of Equation (10) Using the standard derivation and the optimality condition (9), we have d F(µ, νm) inf ν G d F(µ, ν) = d F(µ, νm) d F(ˆµm, νm) + d F(ˆµm, νm) inf ν G d F(µ, ν) d F(µ, νm) d F(ˆµm, νm) + inf ν G d F(ˆµm, ν) inf ν G d F(µ, ν) + ϵ. Therefore, we obtain d F(µ, νm) inf ν G d F(µ, ν) 2 sup ν G |d F(µ, ν) d F(ˆµm, ν)| + ϵ. Combining with the definition (1), we obtain d F(µ, νm) inf ν G d F(µ, ν) 2 sup f F |Eµ[f] Eˆµm[f]| + ϵ. Proof of Theorem 3.1. First of all, since F is even, we have supf F |Eµ[f] Eˆµm[f]| = supf F (Eµ[f] Eˆµm[f]). Consider the function h(X1, X2, . . . , Xm) = sup f F (Eµ[f] Eˆµm[f]) . Since f takes values in [ , ], changing Xi to another independent copy X i can change h by no more than 2 /m. Mc Diarmid s inequality implies that with probability at least 1 δ, sup f F (Eµ[f] Eˆµm[f]) E sup f F (Eµ[f] Eˆµm[f]) Standard argument on Rademacher complexity gives sup f F (Eµ[f] Eˆµm[f]) Combining the two estimates above and Eqn. (10), we conclude the proof. Published as a conference paper at ICLR 2018 Proof of Corollary 3.2. Part of the proof is from Proposition 7 in Bach (2017). More accurately, the discriminator set we use here is i=1 wi max(v i [x; 1], 0): i=1 |wi| 1, vi 2 = 1 1 i n for a fix n N. Since x 2 1 and v 2 1, it is easy to see that f 2 for all f F. We want to estimate R(µ) m (F) and then use Theorem 3.1 to prove the result. First, it s easy to verify that = sup v 2=1 i τi max(v [Xi; 1], 0) Then we have R(µ) m (F) = E i τi max(v [Xi; 1], 0) i τiv [Xi; 1] i τi[Xi; 1] where we use the 1-Lipschitz property of max(x, 0) and Talagrand s contraction lemma (Ledoux & Talagrand, 2013) in the inequality step. From Kakade et al. (2009), we get the Rademacher complexity of linear functions i τi[Xi; 1] Therefore, we obtain R(µ) m (F) 2 Combined with f 2 and Theorem 3.1, we finish the proof. Proof of Corollary 3.3. We need to derive a bound for the Rademacher complexity (11). For fixed {xi}m i=1, let s consider Xθ = 1 L m Pm i=1 τifθ(xi). First of all, {Xθ : θ Θ} is a sub-Gaussian process with respect to the Eulidean distance on Θ, i.e., for all θ, θ Θ and all λ > 0 E [exp (λ(Xθ Xθ ))] exp λ2 θ θ 2 2 2 This is a standard result and can be derived by the Hoeffding s lemma. Secondly, we have the following bound for the ϵ-cover number of Θ: log N(ϵ, Θ, 2) p log( p/ϵ). (30) This bound is from the following simple construction. Consider a uniform grid with grid size 2ϵ/ p. The balls with centers as the grid points and with radius ϵ cover the unit cubic on Rp, i.e., [ 1, 1]p. The number of balls in this construction is ( p/ϵ)p. Finally, by applying Dudley s entropy integral, we have E sup θ Θ Xθ log N(ϵ, Θ, 2)dϵ (31) Combing (30) and (31) and taking τ = log( p/ϵ), we obtain E sup θ Θ Xθ 0 τ 1/2e τdτ = 4 Notice that Eqn. (32) holds true for arbitrary samples {xi}m i=1, and thus we conclude that R(µ) m (F) C/ m, where C = 8 Published as a conference paper at ICLR 2018 Proof of Corollary 3.4. Lemma 22 in Bartlett & Mendelson (2003) shows that if supx X k(x, x) Ck + , we have R(µ) m (F) 2 p Ck/m for any µ PB(X). Also, note that f(x) ||f||H p k(x, x) ||f||H Ck. Combined with Theorem 3.1, we conclude the proof. Proof of Corollary 3.5. Use Proposition 3.1 and note that KL(µ, νm) ΛF,G d F(µ, ν) and d F(µ, ν) TV(µ, ν) p 2KL(µ, ν) by Pinsker s inequality. E INCONSISTENCY BETWEEN GAN S LOSS AND TESTING LIKELIHOOD In this section, we will test our analysis of the consistency of GAN objective and likelihood objective on two toy datasets, e.g., a 2D Gaussian dataset and a 2D 8-Gaussian mixture dataset. E.1 A 2D GAUSSIAN EXAMPLE The underlying ground-truth distribution is a 2D Gaussian with mean (0.5, 0.5) and covariance matrix 1 128 17 15 15 17 . We take 105 samples for training, and 1000 samples for testing. For a 2D Gaussian distribution, we use the following generator x1 x2 where z = z1 z2 is a standard 2D normal random vector, and l R, s = s1 s2 R2 and b = b1 b2 R2 are trainable parameters in the generator. We train the generative model by WGAN with weight clipping. In the first experiment, the discriminator set is a neural network with one hidden layer and 500 hidden neurons, i.e., i=1 αi max(w i [x; 1], 0) : 0.05 α 0.05, 0.05 wi 0.05 i}. Motivated by Corollary 3.5, in the second experiment, we take the discriminators to be the logdensity ratio between two Gaussian distributions, which are quadratic polynomials: Fquad = {x Ax + b x : 0.05 A 0.05, 0.05 b 0.05}. We plot their results in Figure 1. We can see that both discriminators behave well: The training loss (the neural distance) converge to zero, and the testing log likelihood increases monotonically during the training. However, the quadratic polynomial discriminators Fquad yields higher testing log likelihood and better generative model at the convergence. This is expected because Corollary 3.5 guarantees that the testing log likelihood is bounded by the GAN loss (up to a constant), while it is not true for Fnn. We can also maximize the likelihood (MLE) on the training dataset to train the model, and we show its result in Figure 3. We can see that both MLE and Q-GAN (refers to WGAN with the quadratic discriminator Fquad) yield similar results. However, directly maximizing the likelihood converges much faster than the WGAN in this example. In this simple Gaussian example, the WGAN loss and the testing log likelihood are consistent. We indeed observe that by carefully choosing the discriminator set (as suggested in Corollary 3.5), the testing log likelihood can be simultaneously optimized as we optimize the GAN objective. E.2 AN EXAMPLE OF 2D 8-GAUSSIAN MIXTURE The underlying ground truth distribution is a 2D Gaussian mixture with 8 Gaussians and with equal weights. Their centers are distributed equally on the circle centered at the origin and with radius Published as a conference paper at ICLR 2018 Figure 1: Negative GAN losses and testing likelihood. qgan: WGAN with quadratic polynomials as discriminator. wgan: WGAN with neural discriminator. Figure 2: Samples from trained generators. Left: WGAN with quadratic polynomials as discriminator. Right: WGAN with a neural discriminator with 500 neurons. 2, and their standard deviations are all 0.01414. We take 105 samples as training dataset, and 1000 samples as testing dataset. We show one batch (256) of training dataset and the testing dataset in Figure 4. Note that that the density of the ground-truth distribution is highly singular. We still use Eqn. (33) as the generator for a single Gaussian component. Our generator assume that there are 8 Gaussian components and they have equal weights, and thus our generator does not have any modeling error. The training parameters are eight sets of scaling and biasing parameters in Eqn. (33), each for one Gaussian component. We first train the model by WGAN with clipping. We use an MLP with 4 hidden layers and relu activations as the discriminator set. We show the result in Figure 5. We can see that the generator s samples are nearly indistinguishable from the real samples. However, the GAN loss and the log likelihood are not consistent. In the initial stage of training, both the negative GAN loss and log likelihood are increasing. As the training goes on, the generator s density gets more and more singular, the log likelihood behaves erratically in the latter stage of training. Although the negative GAN loss is still increasing, the log likelihood oscillates a lot, and in fact over half of time the log likelihood is . We show the generated samples at intermediate steps in Figure 6, and we indeed see that the likelihood starts to oscillate violently when the generator s distribution gets singular. This inconsistency between GAN loss and likelihood is observed by other works as well. The reason for this consistency is that the neural discriminators are not a good approximation of the singular density ratios. We also train the model by maximizing likelihood on the training dataset. We show the result in Figure 7. We can see that the maximal likelihood training got stuck in a local minimum, and failed to exactly recover all 8 components. The log likelihood on training and testing datasets are consistent as expected. Although the log likelihood ( 2.7) obtained by maximizing likelihood is higher than Published as a conference paper at ICLR 2018 Figure 3: Left: samples from the maximal likelihood estimate. Right: log likelihood on training and testing datasets, trained with SGD. Figure 4: Samples from training and testing datasets. that ( 2.0) obtained by WGAN training, its generator is obviously worse than what we obtained in WGAN training. The reason for this is that the negative log-likelihood loss has many local minima, and maximizing likelihood is easy to get trapped in a local minimum. The Flow GAN (Grover et al., 2017) proposed to combine the WGAN loss and the log likelihood to solve the inconsistency problem. We showed the Flow GAN result on this dataset in Figure 8. We can see that training by Flow GAN indeed makes the training loss and log likelihood consistent. However, Flow GAN got stuck in a local minimum as maximizing likelihood did, which is not desirable. Published as a conference paper at ICLR 2018 Figure 5: Left: samples from training dataset (yellow) and samples from generator (green). Right: negative GAN loss and log likelihood (evaluated on the testing dataset). Figure 6: Left to right: generated samples at step 100, 200 and 300, respectively. Figure 7: Left: samples from training dataset and samples from generator. Right: log likelihood on training and testing dataset. Figure 8: Left: samples from training dataset and samples from generator. Right: negative Flow GAN loss and log likelihood on testing dataset.