# chisquare_generative_adversarial_network__e8ec2a91.pdf χ2 Generative Adversarial Network Chenyang Tao 1 Liqun Chen 1 Ricardo Henao 1 Jianfeng Feng 2 Lawrence Carin 1 To assess the difference between real and synthetic data, Generative Adversarial Networks (GANs) are trained using a distribution discrepancy measure. Three widely employed measures are information-theoretic divergences, integral probability metrics, and Hilbert space discrepancy metrics. We elucidate the theoretical connections between these three popular GAN training criteria and propose a novel procedure, called χ2GAN, that is conceptually simple, stable at training and resistant to mode collapse. Our procedure naturally generalizes to address the problem of simultaneous matching of multiple distributions. Further, we propose a resampling strategy that significantly improves sample quality, by repurposing the trained critic function via an importance weighting mechanism. Experiments show that the proposed procedure improves stability and convergence, and yields state-of-art results on a wide range of generative modeling tasks. 1. Introduction Learning to sample from complicated distributions has attracted considerable recent interest, with many important applications (Zhu et al., 2017; Ledig et al., 2017; Yu et al., 2017; Hu et al., 2017). Likelihood-free models avoid the need to explicitly assume a particular parametrization of the data-generating distribution p G(x). Such models implicitly define a distribution via generator G(z; θ) : Z X, and a latent random variable Z with pre-specified distribution q(z). Samples from the generator are produced by first drawing Z q(z) and then feeding it through the generator. To match p G(x) to the true data distribution pd(x), one estimates a discrepancy measure, d(pd, p G). In the GAN framework, the discrepancy is first estimated by maximizing an auxiliary variational functional V (pd, p G; D) : 1Electrical & Computer Engineering, Duke University, Durham, NC 27708, USA 2ISTBI, Fudan University, Shanghai, China. Correspondence to: Chenyang Tao . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). P P R between distributions pd(x) and p G(x) satisfying d(pd, p G) = max D V (pd, p G; D), where P is the space of probability distributions and V (pd, p G; D) is estimated using samples from the two distributions. Function D(x; ω), parameterized by ω and known as the critic function, is intended to maximally discriminate between samples of the two distributions. One seeks to match the generator distribution p G(x) to the unknown true distribution pd(x) by solving a minimax game between the critic and the generator: min G max D V (pd, p G; D). Following ideas from the original GAN (Goodfellow et al., 2014), which optimizes the Jensen-Shannon divergence, much recent work has focused on information-theoretic divergences, such as the KL-divergence (Sønderby et al., 2017). Many other studies have investigated the generalized f-divergence (Csisz ar, 1963), Divf(pd p G) R f pd(x) p G(x) p G(x) dx, where f( ) : R R is a convex function satisfying f(1) = 0, that summarizes the local discrepancy between pd(x) and p G(x). Nowozin et al. (2016) proposed an algorithm based on the variational formulation of Divf(pd p G), Uehara et al. (2016) explored in depth its density-ratio formulation, and Nock et al. (2017) further generalized it from an information-geometric perspective. Interestingly, Mao et al. (2017) showed that a specific type of f-divergence, namely the χ2-divergence, can be directly optimized for GAN learning, by recasting it as a least-squares regression problem. However, Arjovsky & Bottou (2017) showed that when using divergence-based objectives, the parameter updates for the generator can be either uninformative or numerically unstable, and divergence-based objectives may not be continuous wrt the generator parameters. These issues motivated development of GAN formulations based on Integral Probability Metrics (IPMs) (M uller, 1997). IPM models seek to optimize an objective of the form VIPM(pd, p G; D) = EX pd[D(X; ω)] EX p G[D(X ; ω)], where EX pd[ ] denotes the expectation wrt to distributions pd(x). When the critic D(x; ω) is chosen from a unit ball of Lipschitz-1 functions ( D Lip 1), the IPM reduces to the Wasserstein1 or earth mover s distance (Rubner et al., 2000). In this case, the challenge with d IPM(pd, p G) lies in constraining the critic function Lipschitz constant (Arjovsky et al., 2017; Gulrajani et al., 2017; Miyato et al., 2017). Chi-square Generative Adversarial Network Separately, Reproducing Kernel Hilbert Space (RKHS) theory has motivated development of a powerful set of methods to handle probability problems (Muandet et al., 2017). In particular, the embedding of probability measures via kernels (Sriperumbudur et al., 2010) has attracted significant interest. Let κ( , ) be a positive definite function known as the kernel function. The kernel embedding of distribution p(x) is given by νp(x) EX p[κ(x, X)] . The Maximal Mean Discrepancy MMD(pd, p G) νpd νp G H, defines a distance metric on distributions pd(x) and p G(x), where H is the kernel-induced norm in H, a Hilbert space. MMD readily translates into an algorithm that does not require the adversarial game for generative modeling (Li et al., 2015; Dziugaite et al., 2015). However, RKHS-based generative models have high computational cost, while in the case of generative models, also struggling when dealing with complex distributions (Bi nkowski et al., 2018). In practice, good performance can be achieved with careful hyperparameter tuning and by introducing auxiliary loss terms to the objective (Zhang et al., 2017; Li et al., 2017b). From a pragmatic perspective, GANs rarely converge to the desired equilibrium (Arora & Zhang, 2018), instead settling for a sub-optimal local solution, where samples produced by the trained generative model often lack diversity (Salimans et al., 2016). To alleviate these issues, most existing studies focus on seeking more stable architectures (Radford et al., 2016), enforcing heuristically-derived or theoretically-inspired regularized objectives (Salimans et al., 2016; Warde-Farley & Bengio, 2017; Mescheder et al., 2017b; Roth et al., 2017), and procedures that leverage carefully designed optimization paths (Karras et al., 2018). We present new theoretical insights on GAN-based generative modeling, the cause of some of its difficulties, and principled solutions to address associated challenges. Our key contributions include: (i) We present theory connecting three major generative modeling frameworks: divergence-, IPMand kernel-based approaches. (ii) A novel, conceptually simple procedure is introduced, termed χ2-GAN, that is stable at training and embraces sample diversity during generation. (iii) It is demonstrated that our formulation naturally generalizes to problems requiring simultaneous matching of multiple distributions. (iv) We propose to fully exploit the learned critic function, by repurposing it as a weighting mechanism in a resampling procedure, leveraging useful information from the critic to improve sample quality. 2. Learning χ2 GANs 2.1. Distribution Mixture and Generative Modeling Consider joint random variables (X, Y ), where X is drawn from mixture distribution [p(x) + q(x)]/2, and Y is a random variable identifying the mixture component from which X is drawn; Y = 1 if X is drawn from p(x), with Y = 1 if X is drawn from q(x). We denote the joint density for (X, Y ) as µ(x, y; p, q); to avoid notational clutter, we often omit its dependency on (p, q) when the context is clear. Further, let µ(x) and µ(y) be the marginals of µ(x, y). It can be readily verified that X and Y from µ(x, y) are statistically independent if and only if p(x) and q(x) match. For subsequent generative modeling, p(x) is the true data distribution, pd(x), and q(x) is the generator distribution, p G(x). We can therefore cast the problem of learning the parameters of a generative model G(x; θ) as seeking to match our generator distribution to that of the data, by minimizing the statistical dependency between the data variable, X, and its label, Y . More generally, random variables (X, Y ) are manifested with X drawn from mixture model 1 M PM m=1 pm(x) and with Y {1, . . . , M} identifying the mixture component from which X is drawn. Extending the ideas discussed above, we can jointly match M > 2 distributions {pm(x)}, by minimizing the statistical dependency between data X and mixture-component label Y . 2.2. Covariance Operator and Statistical Dependency Let f(x) : X R and g(y) : Y R be two squareintegrable functions defined over the domains of X and Y , respectively. The covariance wrt the joint density µ(x, y) is cov(f, g) = EX,Y µ[(f(X) E[f])(g(Y ) E[g])], where E[f] = EX[f(X)] is the marginal mean of f(x), similarly defined for E[g]. When X and Y are statistically independent, i.e. µ(x, y) = µ(x)µ(y), cov(f, g) = 0 for every choice of f(x) and g(y). We show below that the statistical independency between X and Y is implied when cov(f, g) = 0 holds for every f(x) and g(y) chosen from a sufficiently rich function space. Let HX and HY be two Hilbert spaces for functions defined on X and Y, equipped with inner products , HX and , HY respectively. In an RKHS, the kernel function defines the inner product, i.e., , H = κ( , ). The crosscovariance operator wrt the triplet {µ(x, y), HX, HY } is defined as the operator CXY : HX HY satisfying g, CXY f HY = cov(f(X), g(Y )), for all f(x) HX and g(y) HY . The existence of CXY is a direct result of the Riesz representation theorem (Yoshida, 1974). One can further define the covariance operator CXX : HX HX as f1, CXXf2 HX = cov(f1(x), f2(x)), (1) and similarly for CY Y . Chi-square Generative Adversarial Network Further assume HX and HY are separable and let E = {ei} i=1 and F = {fi} i=1 be their respective Complete Orthonormal Systems (CONS). The Hilbert-Schmidt norm HS of an operator A : HX HY is defined as i,j=1 Aei, fj 2 HY , (2) which is independent of the choice of the CONS (Adams & Fournier, 2003). It can be readily verified that for the crosscovariance operator CXY 2 HS = 0 implies cov(f, g) = 0 for f(x) HX and g(y) HY . To understand when does it imply independence, let us further denote the space of square integrable functions wrt random variable X as L2 X. The following result, adapted from Theorem 4 of Gretton et al. (2005a), states that when HX and HY are sufficiently rich, a vanishing CXY 2 HS implies the statistical independence of X and Y , and vice versa. Theorem 1. If HX and HY are dense in L2 X and L2 Y , respectively, then CXY 2 HS = 0 if and only if X and Y are statistically independent. Proofs for all theoretical results are found in the Supplementary Material (SM). The cross-covariance operator norm CXY 2 HS in (2) is at the core of the Hilbert-Schmidt Independence Criteria (HSIC) (Gretton et al., 2012); we are interested in such criteria to assess the independence of X and Y discussed in Sec. 2.1. In an RKHS, let K(n) X be the Gram matrix of samples {xi}n i=1, whose entries are defined as [ K(n) X ]ij = κ(xi, xj), then the centralized Gram matrix is given by K(n) X = Hn K(n) X Hn, where Hn In 1 n1n1T n is the centralizing matrix, with a similar definition for K(n) Y ; 1n is an n-dimensional vector of all ones, and In is the n n identity matrix). The empirical estimator for HSIC results in an elegant population-wise expression: ˆC(n) XY 2 HS 1 n2 tr K(n) X K(n) Y , (3) where tr( ) is the trace operation. Despite its simple mathematical expression, it is difficult to use HSIC as an optimization objective, because computations require quadratic time, O(n2), and (3) depends on the inner product of the Hilbert space, which is not invariant to the kernel function space used. As a result, HSIC is known to be highly sensitive to the choice of the kernel Hilbert space. 2.3. Normalization of Cross-covariance Operator To circumvent the dependence on the form of the inner product, we consider the normalized cross-covariance operator VXY : HX HY (Baker, 1973), with slight abuse of notation, given by VXY = C 1/2 Y Y CXY C 1/2 XX . Intuitively, operators C 1/2 XX and C 1/2 Y Y defined as (3) via (1) normalize the covariances in function space, thus VXY can be understood as the cross-correlation operator. The following Lemma formalizes this intuition. Lemma 2. For f 2 HX = g 2 HY = 1, we have g, VXY f HY = corr C 1/2 XX f(X), C 1/2 Y Y g(Y ) . The next two theoretical results expand on the invariance of VXY 2 HS to the choice of the function space. Proposition 3. If E = {ei} i is a complete orthonormal system in HX, then C 1/2 XX E = {C 1/2 XX ei} i is a complete orthonormal system in L2 X. Corollary 4. Let E = {ei} i and F = {fj} j be the respective CONS for HX and HY , we have VXY 2 HS = X n corr C 1/2 XX ei, C 1/2 Y Y fj o2 . (4) The above result shows that VXY 2 HS only depends on C 1/2 XX E and C 1/2 Y Y F, and not directly on the inner product of the Hilbert space. Further, in RKHS it is known that VXY 2 HS is invariant when the kernel functions are characteristic e.g., Gaussian kernels. Fukumizu et al. (2007) proposed a regularized empirical estimator for VXY 2 HS in (4), given by ˆV (n) XY 2 HS = tr(R(n) X R(n) Y ), (5) where R(n) X = K(n) X (K(n) X + ϵn In) 1 (similarly defined for R(n) Y ) and ϵn > 0 is the regularization parameter. Use of a metric like ˆV (n) XY 2 HS appears promising as a means of assessing the independence of X and Y . However, in addition to the challenge of selecting kernels for the RKHSs, (5) comes with significant computational overhead; the matrix inversion in R(n) X requires O(n3) operations. While this estimator is shown to be consistent, provided ϵn 0 and ϵ3 nn , empirical results indicate that ϵn must be carefully tuned to avoid degenerate solutions. Further, despite its theoretical invariance, empirical estimates vary significantly wrt the choice of kernels, while also not performing well in high-dimensional settings. 2.4. χ2 Generative Adversarial Net Key to circumvent the excessive computational burden of (5) is that rather than explicitly computing C 1/2 XX E and Chi-square Generative Adversarial Network C 1/2 Y Y F in (4), we can instead use some pre-specified function spaces E and F to estimate VXY 2 HS. Next, we derive specific results on evaluating VXY 2 HS for the mixture distribution µ(x, y) defined in Sec. 2.1. Let σ2 µ(f) denote the variance of a function f(x) wrt random variable X with marginal µ(x). Let φ(x) = E[Y |X = x] be the conditional expectation of Y given X, gχ(x) = φ(x) σµ(φ) is the variance normalized φ(x), and ψ(x) = Pr(Y = +1|X = x) is the conditional probability of Y = +1 given X, i.e., the critic used in the original GAN, D(x; ω). The χ2 mutual information of µ(x, y) is defined as MIχ(µ) = ZZ µ(x, y) µ(x)µ(y) 1 2 µ(x)µ(y) dx dy, where µ(x) and µ(y) denote the marginals. The χ2 distance between p(x) and q(x) is thus defined as Disχ(p, q) = s Z (p(x) q(x))2 The next proposition, connecting divergence, IPM and Hilbert spaced based discrepancies constitutes our main result. Proposition 5. The following quantities are identical: (i) VY X 2 HS, (ii) MIχ(µ), (iii) {Disχ(p, q)}4/(16 φ(X) 2 µ), (iv) {corr(φ(X), Y )}2, (v) {corr(ψ(X), Y )}2, (vi) (EX1 p[gχ(X1)] EX2 q[gχ(X2)])2/4. The equivalence between (i)-(iii) unveils the connections between the RKHS independence metric VY X 2 HS, the information theoretic divergence metric MIχ(µ) and the variance-constrained IPM metric Disχ(p, q) (Mroueh & Sercu, 2017). On the other hand, (iv)-(vi) provide us with practical empirical estimators for these metrics. The key insight from Proposition 5 is that we only need to compute the critic ψ(x) or φ(x) to estimate VXY 2 HS, or equivalently, the χ2 mutual information of the mixture distribution in Sec. 2.1. Consequently, to formulate an optimization procedure for generative modeling, we can minimize VXY 2 HS wrt the generator. We call the framework χ2-GAN due to its inherent connection to the χ2 metric. We now detail the construction of χ2-GAN based on estimator (iv) from Proposition 5. Let p G(x) be the data generating distribution implicitly defined by (a deep neural) generator G(z; θ) parametrized by θ, Z q(z), and let D = {D(x; ω); ω Ω} be a parameterized family from which we choose the critic D(x; ω). The (least-squares) Algorithm 1 χ2 GAN. Input: data {xi}, batchsize b, decay ρ, learning rate δ. for t = 1, 2, 3, . . . do 1. Sample minibatch {xi pd(x), zi p(z)}b i=1 2. Update the critic D(x; ω) to improve (minimize) P i(D(xi; ω) 1)2 + (D(G(zi; θt 1); ω) + 1)2 3. Update the variance σ2 µ for D(x; ωt) 4. Update the correlation estimate ct(θt 1) = (1 ρ)ct 1 2b P i D(xi;ωt) D(G(zi;θt 1);ωt) σµ(D(x;ωt)) 5. Update the generator G(z; θ) θt = θt 1 + δ Grad Clip θ(ct(θt 1))2 loss functions for the critic and generator are given by LD(ω|θ) = EX,Y µ[(D(X; ω) Y )2], LG(θ|ω) = EX pd[D(X; ω)] EX p G[D(X; ω)] σµ(D(X; ω)) where µ(x, y) = 1 2p G(x) as in Sec. 2.1. Note that µ(x, y) and σµ( ) are implicitly dependent on θ via p G(x). We further denote ω (θ) arg minω LD(ω|θ), the optimal critic parameters conditioned on the generator. When D is dense in L2 X, we have D(x; ω (θ)) = φ(x), and LG(θ|ω (θ)) = 4 VXY 2 HS, which follows from Proposition 5. To match p G(x) to pd(x), we solve minθ LG(θ|ω (θ)). We propose to decouple the above optimization scheme into the following GAN-like iterations: ωt arg minω LD(ω|θt 1), θt arg minθ (θt 1) LG(θ|ωt, θt 1), where (θt 1) denotes the trust region for the generator update. In LG(θ|ωt, θt 1), we have replaced σµ(D(x; ω)) with its stale estimate σµt 1(D(x; ω)). We regularize the update of the generator to make sure LG(θ|ωt) remains a good approximation to VXY 2 HS. This can be implemented with proximal gradient descent, or simply via gradient clipping as summarized in Algorithm 1. 2.5. Joint Matching of Multiple Distributions We now discuss the generalization to multi-component mixtures. Since the M components in 1 M PM m=1 pm(x) are mutually exclusive, the space of L2 Y has dimension M 1. We use the following M 1 empirical basis functions to Chi-square Generative Adversarial Network Figure 1. Relations between likelihood-free generative models. span L2 Y : γm(yl) = 1, m = l (M 1) 1, m = l , for m {1, . . . , M 1} and l {1, . . . , M}. We collect {γm(y)}M 1 m=1 into a vector function Γ(y) = [γ1(y), , γM 1(y)]T , with corresponding (M 1) (M 1) covariance matrix ΣY with elements [ΣY ]ij = 1/(M 1) if i = j and [ΣY ]ij = 1/(M 1)2 otherwise. Note that EY [Γ(Y )] = 0, and this construction exactly recovers our Y { 1} binary labeling when M = 2. For the data side, we use critics ψm(x) = Pr(Y = ym|X = x) as the empirical basis, and similarly use a compact M 1 dimensional vector representation Ψ(x) = [ψ1(x), . . . , ψ(M 1)(x)]T . We use Ψ(x) = Ψ(x) EX[ Ψ(X)] for mean centering and denote its covariance matrix as ΣX. The objective for multi-distribution matching is then Vχ2 = EX,Y µ h Σ 1/2 X Ψ(X)T Γ(Y )Σ 1/2 Y i 2 where Fro denotes the Frobenius norm. In practice, we use cross-entropy loss to estimate Ψ(x), and leverage a moving average estimator to track the expectation and covariance of Ψ(x), otherwise it is similar to Algorithm 1. The complete algorithm and additional remarks are found in the SM. 2.6. Importance Resampling Current practice in modeling with GANs discards the critic after training, with the learned generator used for sampling. However, the generator distribution rarely reaches the desired target distribution on real-world complex datasets, while the trained critic contains useful local information that does not get incorporated into the generator during training. Consider two cases: i) the generator does not have enough capacity to characterize the target distribution, and Algorithm 2 Importance resampling. Input: Generator G(z; θ), critic D(x; ω), sample size n. 1. Sample candidates: {xi = G(zi; θ)}n i=1, zi p(z). 2. Compute importance weights: wi = wi(Pn j=1 wj) 1, where wi = ζ(D(xi; ω)). 3. Sample j Cat(w1, . . . , wn). Return: xj = G(zj; θ) approaches a solution that covers the support of the target without properly capturing its topology; ii) the target has disjoint support regions and the generator covers each of them, but with inconsistent amounts of probability mass. In either case, as described below, the critic can be repurposed to provide additional information to assist sample generation (after training). To fully harness the information from the critic to improve sample quality, we propose to resample the generator. Recall that in importance sampling, one uses a proposal distribution q(x) to sample, and then one reweights samples by their importance weights w(x) p(x)/q(x), to compute the expectation wrt to the target distribution p(x), i.e., EX p[f(X)] = EX q [w(X)f(X)] 1 p(xi) q(xi)f(xi), where {xi}n i=1 are iid samples from q(x). We propose treating the data-generating distribution p G(x) as the proposal distribution, and collecting the importance weights w(x) = pd(x)/p G(x) from the critic. For divergence-based generative models (f-GAN, χ2-GAN, etc.), importance weights can often be directly computed from the critic via some simple transformation ζ( ) (see SM for details). For other generative models, an auxiliary log-density ratio critic can be trained with cross-entropy loss to track the importance weights. We summarize the importance resampling procedure in Algorithm 2. 3. Related Work The proposed χ2 GAN connects three popular likelihoodfree generative modeling frameworks (see Figure 1). It is derived from the theory of RKHS independence analysis and it can be shown that the popular MMD objective (Gretton et al., 2012) is an unnormalized version of our χ2 objective (see SM). χ2 GAN optimizes a divergence criteria with an IPM loss, using a critic trained with a stable leastsquares loss, similar to that of LS-GAN (Mao et al., 2017). Regular divergence-based GANs directly optimize the divergence between pd(x) and p G(x), while χ2-GAN instead optimizes the divergence between µ(x, y) and µ(x)µ(y), which characterizes the independence between the sample Chi-square Generative Adversarial Network Figure 2. Toy model comparison for GAN, MMD-GAN, WGANGP and χ2-GAN. Distributions are visualized with KDE plots. and associated label, i.e. the mutual information. This allows easy generalization beyond matching two distributions. Fisher GAN (Mroueh & Sercu, 2017) is closest to our χ2GAN, which builds on the IPM framework and like ours, also normalizes the critic with its second moment. The key differences are that Fisher GAN relies on a more sophisticated augmented Lagrangian to optimize the same objective for both the critic and generator, while χ2 GAN decouples the critic and generator objectives, requiring simpler (unconstrained) stochastic-gradient-descent-type updates. For RKHS-based generative modeling the choice of kernel is crucial. Theoretical properties and empirical performance have been analyzed for a number of popular kernels, such as inverse multi-quadratic (Gorham & Mackey, 2017), Plummer (Unterthiner et al., 2018), rational quadratic (Bi nkowski et al., 2018) and energy distance (Liu, 2017). Although derived from a kernel formulation, our training procedure is in practice kernel-free. Modern generative modeling has deep roots in statistical testing. Prior studies have primarily focused on two sample tests (Gretton et al., 2012). Our study builds on the work of independence testing (Gretton et al., 2005b), which generalizes two sample tests and can be extended to settings with multiple generators and critics (Durugkar et al., 2017). Simultaneous matching of multiple distributions is a key technique needed in many machine learning applications (Zhao et al., 2017). In the GAN context, generalization to the JSD metric have been explored to address this challenge (Gan et al., 2017; Li et al., 2017a; Pu et al., 2018b). However for RKHS and IPM-based generative models, currently there is no generalization, and one has to build M(M 1) 2 pairwise critics, which can be prohibitively expensive when M grows large. Our χ2 GAN represents the first attempt to bridge this gap. Our theory implies using a quadratic number of critics is unnecessary, and instead χ2 GAN computes an M 1 dimensional critic. Importance sampling is a classic technique used in Monte Table 1. Quantitative results on MNIST. is estimated using AIS; is reported in (Hu et al., 2018); are likelihood-based models. Model log p(x) IS NF (Rezende & Mohamed, 2015) -85.1 - Pixel RNN (Oord et al., 2016) -79.2 - AVB (Mescheder et al., 2017a) -79.5 - ASVAE (Pu et al., 2017) -81.14 - s VAE-r (Pu et al., 2018a) -79.26 9.12 GAN (Goodfellow et al., 2014) -114.25 8.34 WGAN-GP (Gulrajani et al., 2017) -79.92 8.45 DCGAN (Radford et al., 2016) -79.47 8.93 χ2 GAN (ours) -78.85 9.01 Carlo methods (Liu, 2008). One of its key applications is to evaluate the quality of statistical models (Neal, 2001; Wu et al., 2017). Recently, this idea has been used in likelihoodbased generative models to sharpen the variational bound (Burda et al., 2016), and has been extended to improve the training of likelihood-assisted GAN variants (Hu et al., 2018). To the best of the authors knowledge, resampling the generator as proposed here has not yet been explored before and, not being exclusive to our formulation, can be easily used with other methods. 4. Experiments We consider a wide range of synthetic and real-world tasks to experimentally validate χ2 GAN, and benchmark it against other state-of-the-art solutions. All experiments are implemented with Tensorflow and run on a single NVIDIA TITAN X GPU. Details of the experimental setup are in the SM, and code for our experiments are available from https://www.github.com/ chenyang-tao/chi2gan. 4.1. Toy Distributions We compare χ2 GAN with one representative model from each category, namely the original GAN, WGAN-GP and MMD GAN, on three toy distributions. The same model architecture is used for all models, with the exception of MMD GAN which does not need an explicit critic function. The generation results are summarized in Figure 2. All models except MMD perform well on the baseline Swiss roll experiment. The mixture-of-Gaussians experiments test algorithm robustness to mode collapse. The original GAN demonstrates its vulnerability dealing with distributions with disjoint modes, and MMD learns an overly smoothed distribution, even with carefully tuned kernel hyperparameter. Both WGAN-GP and χ2 GAN successfully learn good approximations to the target distribution, with the latter showing faster convergence and more-stable training. Chi-square Generative Adversarial Network Table 2. Unsupervised Inception Score on CIFAR-10 ALI (Dumoulin et al., 2017) 5.34 .05 DCGAN (Radford et al., 2016) 6.16 .07 MMD-GAN (Li et al., 2017b) 6.17 .07 WGAN-GP (Gulrajani et al., 2017) 6.56 .05 ASVAE (Pu et al., 2017) 6.89 .05 s VAE-r (Pu et al., 2018a) 6.96 .066 χ2-GAN (ours) 7.47 0.105 WGAN-GP Res Net 7.86 .07 Fisher-GAN Res Net (Mroueh & Sercu, 2017) 7.90 .05 χ2-GAN Res Net (ours) 7.88 .10 4.2. Image Datasets We trained χ2 GAN on a number of popular image datasets to demonstrate its ability to learn complex distributions for real-world applications. For supervised generation tasks, we condition the generator on the label of an image. To quantitatively evaluate model performance, we consider the following metrics in our experiments: (1) Inception score (IS) (Salimans et al., 2016) for datasets associated with onehot labels; (2) AIS score (Wu et al., 2017) to estimate the log likelihood. We only report the quantitative results for the raw generator distribution in the main text, and results for the importance resampled generator are found separately in SM. Two network architectures were considered in these experiments: DCGAN and Res Net. In all experiments we have used Xaiver initialization and Adam optimizer. All images shown are random samples and not cherry picked. We note better quantitative results have been reported in the literature using specific techniques orthogonal to our main contributions (Karras et al., 2018; Warde-Farley & Bengio, 2017; Miyato et al., 2017; Salimans et al., 2018), see SM for a discussion. MNIST We used the binarized MNIST in this experiment and compared with the results from prior results in Table 1. Our χ2 GAN achieves an AIS score of 78.85 nats and an IS score of 9.01. These results lead the chart for all likelihood-free generative models we considered, and they are comparable to, or even better than those from the best-performing likelihood-based models. Cifar10 For this dataset, we experimented with both unsupervised and supervised generation tasks. Quantitative results are summarized in Tables 2 and 3. For both tasks, χ2 GAN consistently achieved state-of-the-art results obtained with the network architectures considered. Most notably, our χ2 GAN significantly outperformed DCGAN, MMD GAN and WGAN-GP in the unsupervised generation task with the DCGAN architecture. We also provide quantitative results with the Fr echet Inception Distance (FID) (Heusel et al., 2017) in SM. See Figure 3 for qualitative assessment. Celeb A We provide a comparison of DCGAN, Fisher GAN Table 3. Supervised Inception Score on CIFAR-10 Stein GAN (Wang & Liu, 2016) 6.16 .07 DCGAN (Radford et al., 2016) 6.58 .05 AC-GAN (Odena et al., 2017) 8.25 .07 SGAN (Huang et al., 2017) 8.59 .12 Fisher-GAN Res Net (Mroueh & Sercu, 2017) 8.16 .12 WGAN-GP Res Net (Gulrajani et al., 2017) 8.42 .10 χ2-GAN Res Net (ours) 8.44 .10 Table 4. Unsupervised Inception Score on Image Net DCGAN (Radford et al., 2016) 5.965 Pixel CNN++ (Salimans et al., 2017) 7.65 ASVAE (Pu et al., 2017) 11.14 χ2-GAN (ours) 11.34 and our χ2 GAN on the face generation task in Figure 4. We trained DCGAN and χ2 GAN to generate the face samples, and collected Fisher GAN s samples from the original paper. All models used the DCGAN architecture. We observe that χ2 GAN produced (subjectively) more compelling samples, capturing facial details, illumination and more realistic textures compared with its counterparts. Additional samples are shown in the SM, and we find no evidence of mode collapse. We also provide additional experimental evidence in the SM to verify χ2-GAN learns generalizable features rather than remembering training examples. We also use the face-generation task to demonstrate the efficacy of importance resampling. In Figure 4 we compare the accepted and rejected samples1. Those less-compelling samples produced by the generator are immediately identified based on the importance score. Quantitive results for importance resampling can be found in the SM. Image Net We also considered Image Net to evaluate the scalability of the models on large datasets. All images are resized to 64 64, and the quantitative results are reported in Table 4. With the simple DCGAN architecture, our χ2 GAN significantly outperformed more-sophisticated Pixel CNN++ GAN, even surpassing the performance of likelihood-based ASVAE. See the SM for generated samples. Stability, robustness, convergence and sample diversity In all our experiments on the image datasets, χ2 GAN demonstrated stable training dynamics. It showed a similar convergence rate 2 compared with WGAN-GP in terms of iterations, but much cheaper per-iteration cost. χ2 GAN also demonstrated robustness as we varied model architectures, network normalization schemes and hyperparameters 1We used a less well-trained model and picked our samples based on the importance weights to highlight the difference. 2Wrt IS score and visual inspection, see SM. Chi-square Generative Adversarial Network Figure 3. Generated images on Cifar-10. Left: unsupervised generation; right: supervised generation. Figure 4. Face generation quality comparison. From left to right: DCGAN, Fisher GAN, χ2 GAN, high importance weight samples and low importance weight samples. Figure 5. MNIST image translation. First column is the input image from test set, subsequent columns are sampled translations. (results not shown). We also do not find any evidence for mode collapse in our experiments. 4.3. Matching Multiple Distributions To demonstrate the flexibility of the χ2 GAN framework to match multiple distributions, we consider the following translation task: Given paired examples {(xi, zi)}n i=1, sample from all possible translations z for a new observation x, and vice versa. More explicitly, consider the distribution triplet: p0(x, z) = pd(x, z), p1(x, z) = pd(x)q1(z|x; θ1), and p2(x, z) = pd(z)q2(x|z; θ2), where q1(z|x; θ1) and q2(x|z; θ2) are the translation models. When translations are faithful to the data distribution, we have p0(x, z) = p1(x, z) = p2(x, z). Here (x, z) are paired data, and we consider the problem of image-to-image translation. Rotated MNIST In this experiment we pair each MNIST digit with a random sample of the same type, rotated by 90 . Our translation results are presented in Figure 5. It is observed that χ2 GAN translations achieved both fidelity Figure 6. Edges2shoes translation. and diversity for this task. Edges-to-shoes We evaluate the performance on the morerealistic edges-to-shoes dataset, where the model learns to translate between shoes and sketches. As shown in Figure 6, χ2 GAN learned to produce faithful translations. 5. Conclusions We have developed a framework that unifies prior theoretical frameworks for likelihood-free generative modelling, and based on this we proposed a novel algorithm named χ2 GAN. Our approach is conceptually simple, and can be readily generalized to match multiple distributions. Empirical evidence verified that this new method offers competitive performance on a wide range of generation tasks. For future work, we intend to investigate its connections to the likelihood-based generative models, and to seek novel applications by integrating it with other machine learning algorithms. Chi-square Generative Adversarial Network Acknowledgements The authors would like to thank the anonymous reviewers for their insightful comments. This research was supported in part by DARPA, DOE, NIH, ONR and NSF. J Feng is partially supported by the key project of Shanghai Science & Technology Innovation Plan (No. 16JC1420402). The authors would also like to thank Prof. C Leng, Dr. Y Zhang, Dr. Y Pu and K Bai for fruitful discussions. Adams, R. A. and Fournier, J. J. Sobolev spaces, volume 140. Academic press, 2003. Arjovsky, M. and Bottou, L. Towards principled methods for training generative adversarial networks. In NIPS Workshop. 2017. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In ICML, 2017. Arora, S. and Zhang, Y. Do GANs actually learn the distribution? an empirical study. In ICLR, 2018. Baker, C. R. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186: 273 289, 1973. Bi nkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifying mmd gans. In ICLR, 2018. Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. In ICLR, 2016. Csisz ar, I. Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitat von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad, 8: 85 108, 1963. Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., and Courville, A. Adversarially learned inference. In ICLR, 2017. Durugkar, I., Gemp, I., and Mahadevan, S. Generative multi-adversarial networks. In ICLR, 2017. Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. Training generative neural networks via maximum mean discrepancy optimization. In UAI, 2015. Fukumizu, K., Gretton, A., Sun, X., and Sch olkopf, B. Kernel measures of conditional dependence. In NIPS, 2007. Gan, Z., Chen, L., Wang, W., Pu, Y., Zhang, Y., Liu, H., Li, C., and Carin, L. Triangle generative adversarial networks. In NIPS, 2017. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NIPS, 2014. Gorham, J. and Mackey, L. Measuring sample quality with kernels. In ICML, 2017. Gretton, A., Bousquet, O., Smola, A., and Scholkopf, B. Measuring statistical dependence with Hilbert-Schmidt norms. In ALT, 2005a. Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Sch olkopf, B. Kernel methods for measuring independence. The Journal of Machine Learning Research, 6: 2075 2129, 2005b. Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved training of Wasserstein GANs. ar Xiv preprint ar Xiv:1704.00028, 2017. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017. Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., and Xing, E. P. Toward controlled generation of text. In ICML, 2017. Hu, Z., Yang, Z., Salakhutdinov, R., and Xing, E. P. On unifying deep generative models. In ICLR, 2018. Huang, X., Li, Y., Poursaeed, O., Hopcroft, J., and Belongie, S. Stacked generative adversarial networks. In CVPR, 2017. Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018. Ledig, C., Theis, L., Husz ar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017. Li, C., Liu, H., Chen, C., Pu, Y., Chen, L., Henao, R., and Carin, L. Alice: Towards understanding adversarial learning for joint distribution matching. In NIPS, 2017a. Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and P oczos, B. Mmd gan: Towards deeper understanding of moment matching network. In NIPS, 2017b. Li, Y., Swersky, K., and Zemel, R. Generative moment matching networks. In ICML, 2015. Liu, J. S. Monte Carlo strategies in scientific computing. Springer Science & Business Media, 2008. Liu, L. On the two-sample statistic approach to generative adversarial networks. Master s thesis, Princeton University, 2017. Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Paul Smolley, S. Least squares generative adversarial networks. In ICCV. 2017. Mescheder, L., Nowozin, S., and Geiger, A. Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. In ICML, 2017a. Mescheder, L., Nowozin, S., and Geiger, A. The numerics Chi-square Generative Adversarial Network of gans. In NIPS, 2017b. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In ICML Implicit Models Workshop, 2017. Mroueh, Y. and Sercu, T. Fisher GAN. In NIPS. 2017. Mroueh, Y., Sercu, T., and Goel, V. Mc GAN: Mean and covariance feature matching gan. In ICML, 2017. Muandet, K., Fukumizu, K., Sriperumbudur, B., Sch olkopf, B., et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1 141, 2017. M uller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29 (2):429 443, 1997. Neal, R. M. Annealed importance sampling. Statistics and computing, 11(2):125 139, 2001. Nock, R., Cranko, Z., Menon, A. K., Qu, L., and Williamson, R. C. f-gans in an information geometric nutshell. In NIPS, 2017. Nowozin, S., Cseke, B., and Tomioka, R. f-GAN: Training generative neural samplers using variational divergence minimization. In NIPS, 2016. Odena, A., Olah, C., and Shlens, J. Conditional image synthesis with auxiliary classifier GANs. In ICML, 2017. Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. ar Xiv preprint ar Xiv:1601.06759, 2016. Pu, Y., Wang, W., Henao, R., Chen, L., Gan, Z., Li, C., and Carin, L. Adversarial symmetric variational autoencoder. In NIPS, 2017. Pu, Y., Chen, L., Dai, S., Wang, W., Li, C., and Carin, L. Symmetric variational autoencoder and connections to adversarial learning. In AISTATS, 2018a. Pu, Y., S., D., Gan, Z., Wang, W., G., W., Y., Z., Henao, R., and Carin, L. Joint gan: Multi-domain joint distribution learning with generative adversarial nets. In ICML, 2018b. Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. In ICML, 2015. Roth, K., Lucchi, A., Nowozin, S., and Hofmann, T. Stabilizing training of generative adversarial networks through regularization. In NIPS, 2017. Rubner, Y., Tomasi, C., and Guibas, L. J. The earth mover s distance as a metric for image retrieval. International journal of computer vision, 40(2):99 121, 2000. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training GANs. In NIPS, 2016. Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In ICLR, 2017. Salimans, T., Zhang, H., Radford, A., and Metaxas, D. Improving gans using optimal transport. In ICLR, 2018. Sønderby, C. K., Caballero, J., Theis, L., Shi, W., and Husz ar, F. Amortised map inference for image superresolution. In ICLR, 2017. Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Sch olkopf, B., and Lanckriet, G. R. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr):1517 1561, 2010. Uehara, M., Sato, I., Suzuki, M., Nakayama, K., and Matsuo, Y. Generative adversarial nets from a density ratio estimation perspective. ar Xiv preprint ar Xiv:1610.02920, 2016. Unterthiner, T., Nessler, B., Klambauer, G., Heusel, M., Ramsauer, H., and Hochreiter, S. Coulomb GANs: Provably optimal nash equilibria via potential fields. In ICLR, 2018. Wang, D. and Liu, Q. Learning to draw samples: With application to amortized MLE for generative adversarial learning. ar Xiv:1611.01722, 2016. Warde-Farley, D. and Bengio, Y. Improving generative adversarial networks with denoising feature matching. In ICLR, 2017. Wu, Y., Burda, Y., Salakhutdinov, R., and Grosse, R. On the quantitative analysis of decoder-based generative models. In ICLR. 2017. Yoshida, K. Functional analysis. Grundlehren der mathematischen Wissenschaften. Springer-Verlag, 1974. Yu, L., Zhang, W., Wang, J., and Yu, Y. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, 2017. Zhang, Y., Gan, Z., Fan, K., Chen, Z., Henao, R., Shen, D., and Carin, L. Adversarial feature matching for text generation. In ICML, 2017. Zhao, H., Zhang, S., Wu, G., Costeira, J. P., Moura, J. M., and Gordon, G. J. Multiple source domain adaptation with adversarial training of neural networks. ar Xiv preprint ar Xiv:1705.09684, 2017. Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.