# on_gradient_regularizers_for_mmd_gans__276c9271.pdf On gradient regularizers for MMD GANs Michael Arbel Gatsby Computational Neuroscience Unit University College London michael.n.arbel@gmail.com Danica J. Sutherland Gatsby Computational Neuroscience Unit University College London djs@djsutherland.ml Mikołaj Bi nkowski Department of Mathematics Imperial College London mikbinkowski@gmail.com Arthur Gretton Gatsby Computational Neuroscience Unit University College London arthur.gretton@gmail.com We propose a principled method for gradient-based regularization of the critic of GAN-like models trained by adversarially optimizing the kernel of a Maximum Mean Discrepancy (MMD). We show that controlling the gradient of the critic is vital to having a sensible loss function, and devise a method to enforce exact, analytical gradient constraints at no additional cost compared to existing approximate techniques based on additive regularizers. The new loss function is provably continuous, and experiments show that it stabilizes and accelerates training, giving image generation models that outperform state-of-the art methods on 160 160 Celeb A and 64 64 unconditional Image Net. 1 Introduction There has been an explosion of interest in implicit generative models (IGMs) over the last few years, especially after the introduction of generative adversarial networks (GANs) [16]. These models allow approximate samples from a complex high-dimensional target distribution P, using a model distribution Qθ, where estimation of likelihoods, exact inference, and so on are not tractable. GANtype IGMs have yielded very impressive empirical results, particularly for image generation, far beyond the quality of samples seen from most earlier generative models [e.g. 18, 22, 23, 24, 38]. These excellent results, however, have depended on adding a variety of methods of regularization and other tricks to stabilize the notoriously difficult optimization problem of GANs [38, 42]. Some of this difficulty is perhaps because when a GAN is viewed as minimizing a discrepancy DGAN(P, Qθ), its gradient θ DGAN(P, Qθ) does not provide useful signal to the generator if the target and model distributions are not absolutely continuous, as is nearly always the case [2]. An alternative set of losses are the integral probability metrics (IPMs) [36], which can give credit to models Qθ near to the target distribution P [3, 8, Section 4 of 15]. IPMs are defined in terms of a critic function: a well behaved function with large amplitude where P and Qθ differ most. The IPM is the difference in the expected critic under P and Qθ, and is zero when the distributions agree. The Wasserstein IPMs, whose critics are made smooth via a Lipschitz constraint, have been particularly successful in IGMs [3, 14, 18]. But the Lipschitz constraint must hold uniformly, which can be hard to enforce. A popular approximation has been to apply a gradient constraint only in expectation [18]: the critic s gradient norm is constrained to be small on points chosen uniformly between P and Q. Another class of IPMs used as IGM losses are the Maximum Mean Discrepancies (MMDs) [17], as in [13, 28]. Here the critic function is a member of a reproducing kernel Hilbert space (except in [50], who learn a deep approximation to an RKHS critic). Better performance can be obtained, These authors contributed equally. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. however, when the MMD kernel is not based directly on image pixels, but on learned features of images. Wasserstein-inspired gradient regularization approaches can be used on the MMD critic when learning these features: [27] uses weight clipping [3], and [5, 7] use a gradient penalty [18]. The recent Sobolev GAN [33] uses a similar constraint on the expected gradient norm, but phrases it as estimating a Sobolev IPM rather than loosely approximating Wasserstein. This expectation can be taken over the same distribution as [18], but other measures are also proposed, such as (P + Qθ) /2. A second recent approach, the spectrally normalized GAN [32], controls the Lipschitz constant of the critic by enforcing the spectral norms of the weight matrices to be 1. Gradient penalties also benefit GANs based on f-divergences [37]: for instance, the spectral normalization technique of [32] can be applied to the critic network of an f-GAN. Alternatively, a gradient penalty can be defined to approximate the effect of blurring P and Qθ with noise [40], which addresses the problem of non-overlapping support [2]. This approach has recently been shown to yield locally convergent optimization in some cases with non-continuous distributions, where the original GAN does not [30]. In this paper, we introduce a novel regularization for the MMD GAN critic of [5, 7, 27], which directly targets generator performance, rather than adopting regularization methods intended to approximate Wasserstein distances [3, 18]. The new MMD regularizer derives from an approach widely used in semi-supervised learning [10, Section 2], where the aim is to define a classification function f which is positive on P (the positive class) and negative on Qθ (negative class), in the absence of labels on many of the samples. The decision boundary between the classes is assumed to be in a region of low density for both P and Qθ: f should therefore be flat where P and Qθ have support (areas with constant label), and have a larger slope in regions of low density. Bousquet et al. [10] propose as their regularizer on f a sum of the variance and a density-weighted gradient norm. We adopt a related penalty on the MMD critic, with the difference that we only apply the penalty on P: thus, the critic is flatter where P has high mass, but does not vanish on the generator samples from Qθ (which we optimize). In excluding Qθ from the critic function constraint, we also avoid the concern raised by [32] that a critic depending on Qθ will change with the current minibatch potentially leading to less stable learning. The resulting discrepancy is no longer an integral probability metric: it is asymmetric, and the critic function class depends on the target P being approximated. We first discuss in Section 2 how MMD-based losses can be used to learn implicit generative models, and how a naive approach could fail. This motivates our new discrepancies, introduced in Section 3. Section 4 demonstrates that these losses outperform state-of-the-art models for image generation. 2 Learning implicit generative models with MMD-based losses An IGM is a model Qθ which aims to approximate a target distribution P over a space X Rd. We will define Qθ by a generator function Gθ : Z X, implemented as a deep network with parameters θ, where Z is a space of latent codes, say R128. We assume a fixed distribution on Z, say Z Uniform [ 1, 1]128 , and call Qθ the distribution of Gθ(Z). We will consider learning by minimizing a discrepancy D between distributions, with D(P, Qθ) 0 and D(P, P) = 0, which we call our loss. We aim to minimize D(P, Qθ) with stochastic gradient descent on an estimator of D. In the present work, we will build losses D based on the Maximum Mean Discrepancy, MMDk(P, Q) = sup f : f Hk 1 EX P[f(X)] EY Q[f(Y )], (1) an integral probability metric where the critic class is the unit ball within Hk, the reproducing kernel Hilbert space with a kernel k. The optimization in (1) admits a simple closed-form optimal critic, f (t) EX P[k(X, t)] EY Q[k(Y, t)]. There is also an unbiased, closed-form estimator of MMD2 k with appealing statistical properties [17] in particular, its sample complexity is independent of the dimension of X, compared to the exponential dependence [52] of the Wasserstein distance W(P, Q) = sup f : f Lip 1 EX P[f(X)] EY Q[f(Y )]. (2) The MMD is continuous in the weak topology for any bounded kernel with Lipschitz embeddings [46, Theorem 3.2(b)], meaning that if Pn converges in distribution to P, Pn D P, then MMD(Pn, P) 0. (W is continuous in the slightly stronger Wasserstein topology [51, Definition 6.9]; Pn W P implies Pn D P, and the two notions coincide if X is bounded.) Continuity means the loss can provide better signal to the generator as Qθ approaches P, as opposed to e.g. Jensen-Shannon where the loss could be constant until suddenly jumping to 0 [e.g. 3, Example 1]. The MMD is also strict, meaning it is zero iff P = Qθ, for characteristic kernels [45]. The Gaussian kernel yields an MMD both continuous in the weak topology and strict. Thus in principle, one need not conduct any alternating optimization in an IGM at all, but merely choose generator parameters θ to minimize MMDk. Despite these appealing properties, using simple pixel-level kernels leads to poor generator samples [8, 13, 28, 48]. More recent MMD GANs [5, 7, 27] achieve better results by using a parameterized family of kernels, {kψ}ψ Ψ, in the Optimized MMD loss previously studied by [44, 46]: DΨ MMD(P, Q) := sup ψ Ψ MMDkψ(P, Q). (3) We primarily consider kernels defined by some fixed kernel K on top of a learned low-dimensional representation φψ : X Rs, i.e. kψ(x, y) = K(φψ(x), φψ(y)), denoted kψ = K φψ. In practice, K is a simple characteristic kernel, e.g. Gaussian, and φψ is usually a deep network with output dimension say s = 16 [7] or even s = 1 (in our experiments). If φψ is powerful enough, this choice is sufficient; we need not try to ensure each kψ is characteristic, as did [27]. Proposition 1. Suppose k = K φψ, with K characteristic and {φψ} rich enough that for any P = Q, there is a ψ Ψ for which φψ#P = φψ#Q.2 Then if P = Q, DΨ MMD(P, Q) > 0. Proof. Let ˆψ Ψ be such that φ ˆ ψ(P) = φ ˆ ψ(Q). Then, since K is characteristic, DΨ MMD(P, Q) = sup ψ Ψ MMDK(φψ#P, φψ#Q) MMDK(φ ˆ ψ#P, φ ˆ ψ#Q) > 0. To estimate DΨ MMD, one can conduct alternating optimization to estimate a ˆψ and then update the generator according to MMDk ˆ ψ, similar to the scheme used in GANs and WGANs. (This form of estimator is justified by an envelope theorem [31], although it is invariably biased [7].) Unlike DGAN or W, fixing a ˆψ and optimizing the generator still yields a sensible distance MMDk ˆ ψ. Early attempts at minimizing DΨ MMD in an IGM, though, were unsuccessful [48, footnote 7]. This could be because for some kernel classes, DΨ MMD is stronger than Wasserstein or MMD. Example 1 (Dirac GAN [30]). We wish to model a point mass at the origin of R, P = δ0, with any possible point mass, Qθ = δθ for θ R. We use a Gaussian kernel of any bandwidth, which can be written as kψ = K φψ with φψ(x) = ψx for ψ Ψ = R and K(a, b) = exp 1 2(a b)2 . Then MMD2 kψ(δ0, δθ) = 2 1 exp 1 2ψ2θ2 , DΨ MMD(δ0, δθ) = 2 θ = 0 0 θ = 0 . Considering DΨ MMD(δ0, δ1/n) = 2 0, even though δ1/n W δ0, shows that the Optimized MMD distance is not continuous in the weak or Wasserstein topologies. This also causes optimization issues. Figure 1 (a) shows gradient vector fields in parameter space, v(θ, ψ) θ MMD2 kψ(δ0, δθ), ψ MMD2 kψ(δ0, δθ) . Some sequences following v (e.g. A) converge to an optimal solution (0, ψ), but some (B) move in the wrong direction, and others (C) are stuck because there is essentially no gradient. Figure 1 (c, red) shows that the optimal DΨ MMD critic is very sharp near P and Q; this is less true for cases where the algorithm converged. We can avoid these issues if we ensure a bounded Lipschitz critic:3 Proposition 2. Assume the critics fψ(x) = (EX P kψ(X, x) EY Q kψ(Y, x))/ MMDkψ(P, Q) are uniformly bounded and have a common Lipschitz constant: supx X,ψ Ψ|fψ(x)| < and supψ Ψ fψ Lip < . In particular, this holds when kψ = K φψ and sup a Rs K(a, a) < , K(a, ) K(b, ) HK LK a b Rs, sup ψ Ψ φψ Lip Lφ < . Then DΨ MMD is continuous in the weak topology: if Pn D P, then DΨ MMD(Pn, P) 0. 2f#P denotes the pushforward of a distribution: if X P, then f(X) f#P. 3[27, Theorem 4] makes a similar claim to Proposition 2, but its proof was incorrect: it tries to uniformly bound MMDkψ W2, but the bound used is for a Wasserstein in terms of kψ(x, ) kψ(y, ) Hkψ . 0 2 4 6 8 10 12 0 (a) : Vector Field of MMD-GAN 0 2 4 6 8 10 12 0 (b) : Vector Field of SMMD-GAN 5 0 5 10 15 20 25 (c) : Optimal Critic 4 2 0 2 4 θ (d) : Optimized Distances MMD SMMD GCMMD Lip MMD Figure 1: The setting of Example 1. (a, b): parameter-space gradient fields for the MMD and the SMMD (Section 3.3); the horizontal axis is θ, and the vertical 1/ψ. (c): optimal MMD critics for θ = 20 with different kernels. (d): the MMD and the distances of Section 3 optimized over ψ. Proof. The main result is [12, Corollary 11.3.4]. To show the claim for kψ = K φψ, note that |fψ(x) fψ(y)| fψ Hkψ kψ(x, ) kψ(y, ) Hkψ , which since fψ Hkψ = 1 is K(φψ(x), ) K(φψ(y), ) HK LK φψ(x) φψ(y) Rs LKLφ x y Rd. Indeed, if we put a box constraint on ψ [27] or regularize the gradient of the critic function [7], the resulting MMD GAN generally matches or outperforms WGAN-based models. Unfortunately, though, an additive gradient penalty doesn t substantially change the vector field of Figure 1 (a), as shown in Figure 5 (Appendix B). We will propose distances with much better convergence behavior. 3 New discrepancies for learning implicit generative models Our aim here is to introduce a discrepancy that can provide useful gradient information when used as an IGM loss. Proofs of results in this section are deferred to Appendix A. 3.1 Lipschitz Maximum Mean Discrepancy Proposition 2 shows that an MMD-like discrepancy can be continuous under the weak topology even when optimizing over kernels, if we directly restrict the critic functions to be Lipschitz. We can easily define such a distance, which we call the Lipschitz MMD: for some λ > 0, Lip MMDk,λ(P, Q) := sup f Hk : f 2 Lip+λ f 2 Hk 1 EX P [f(X)] EY Q [f(Y )] . (4) For a universal kernel k, we conjecture that limλ 0 Lip MMDk,λ(P, Q) W(P, Q). But for any k and λ, Lip MMD is upper-bounded by W, as (4) optimizes over a smaller set of functions than (2). Thus DΨ,λ Lip MMD(P, Q) := supψ Ψ Lip MMDkψ,λ(P, Q) is also upper-bounded by W, and hence is continuous in the Wasserstein topology. It also shows excellent empirical behavior on Example 1 (Figure 1 (d), and Figure 5 in Appendix B). But estimating Lip MMDk,λ, let alone DΨ,λ Lip MMD, is in general extremely difficult (Appendix D), as finding f Lip requires optimization in the input space. Constraining the mean gradient rather than the maximum, as we will do next, is far more tractable. 3.2 Gradient-Constrained Maximum Mean Discrepancy We define the Gradient-Constrained MMD for λ > 0 and using some measure µ as GCMMDµ,k,λ(P, Q) := sup f Hk : f S(µ),k,λ 1 EX P [f(X)] EY Q [f(Y )] , (5) where f 2 S(µ),k,λ := f 2 L2(µ) + f 2 L2(µ) + λ f 2 Hk. (6) 2 L2(µ) = R 2 µ(dx) denotes the squared L2 norm. Rather than directly constraining the Lipschitz constant, the second term f 2 L2(µ) encourages the function f to be flat where µ has mass. In experiments we use µ = P, flattening the critic near the target sample. We add the first term following [10]: in one dimension and with µ uniform, S(µ), ,0 is then an RKHS norm with the kernel κ(x, y) = exp( x y ), which is also a Sobolev space. The correspondence to a Sobolev norm is lost in higher dimensions [53, Ch. 10], but we also found the first term to be beneficial in practice. We can exploit some properties of Hk to compute (5) analytically. Call the difference in kernel mean embeddings η := EX P[k(X, )] EY Q[k(Y, )] Hk; recall MMD(P, Q) = η Hk. Proposition 3. Let ˆµ = PM m=1 δXm. Define η(X) RM with mth entry η(Xm), and η(X) RMd with (m, i)th entry4 iη(Xm). Then under Assumptions (A) to (D) in Appendix A.1, GCMMD2 ˆµ,k,λ(P, Q) = 1 λ MMD2(P, Q) P(η) P(η) = η(X) η(X) 1 η(X) η(X) where K is the kernel matrix Km,m = k(Xm, Xm ), G is the matrix of left derivatives 5 G(m,i),m = ik(Xm, Xm ), and H that of derivatives of both arguments H(m,i),(m ,j) = i j+dk(Xm, Xm ). As long as P and Q have integrable first moments, and µ has second moments, Assumptions (A) to (D) are satisfied e.g. by a Gaussian or linear kernel on top of a differentiable φψ. We can thus estimate the GCMMD based on samples from P, Q, and µ by using the empirical mean ˆη for η. This discrepancy indeed works well in practice: Appendix F.2 shows that optimizing our estimate of Dµ,Ψ,λ GCMMD = supψ Ψ GCMMDµ,kψ,λ yields a good generative model on MNIST. But the linear system of size M + Md is impractical: even on 28 28 images and using a low-rank approximation, the model took days to converge. We therefore design a less expensive discrepancy in the next section. The GCMMD is related to some discrepancies previously used in IGM training. The Fisher GAN [34] uses only the variance constraint f 2 L2(µ) 1. The Sobolev GAN [33] constrains f 2 L2(µ) 1, along with a vanishing boundary condition on f to ensure a well-defined solution (although this was not used in the implementation, and can cause very unintuitive critic behavior; see Appendix C). The authors considered several choices of µ, including the WGAN-GP measure [18] and mixtures (P + Qθ) /2. Rather than enforcing the constraints in closed form as we do, though, these models used additive regularization. We will compare to the Sobolev GAN in experiments. 3.3 Scaled Maximum Mean Discrepancy We will now derive a lower bound on the Gradient-Constrained MMD which retains many of its attractive qualities but can be estimated in time linear in the dimension d. Proposition 4. Make Assumptions (A) to (D). For any f Hk, f S(µ),k,λ σ 1 µ,k,λ f Hk, where σµ,k,λ := 1 . v u u tλ + Z k(x, x)µ(dx) + (y,z)=(x,x) µ(dx). We then define the Scaled Maximum Mean Discrepancy based on this bound of Proposition 4: SMMDµ,k,λ(P, Q) := sup f : σ 1 µ,k,λ f H 1 EX P [f(X)] EY Q [f(Y )] = σµ,k,λ MMDk(P, Q). (7) 4We use (m, i) to denote (m 1)d + i; thus η(X) stacks η(X1), . . . , η(XM) into one vector. 5We use ik(x, y) to denote the partial derivative with respect to xi, and i+dk(x, y) that for yi. Because the constraint in the optimization of (7) is more restrictive than in that of (5), we have that SMMDµ,k,λ(P, Q) GCMMDµ,k,λ(P, Q). The Sobolev norm f S(µ),λ, and a fortiori the gradient norm under µ, is thus also controlled for the SMMD critic. We also show in Appendix F.1 that SMMDµ,k,λ behaves similarly to GCMMDµ,k,λ on Gaussians. If kψ = K φψ and K(a, b) = g( a b 2), then σ 2 k,µ,λ = λ+g(0)+2|g (0)| Eµ φψ(X) 2 F . Or if K is linear, K(a, b) = a Tb, then σ 2 k,µ,λ = λ + Eµ φψ(X) 2 + φψ(X) 2 F . Estimating these terms based on samples from µ is straightforward, giving a natural estimator for the SMMD. Of course, if µ and k are fixed, the SMMD is simply a constant times the MMD, and so behaves in essentially the same way as the MMD. But optimizing the SMMD over a kernel family Ψ, Dµ,Ψ,λ SMMD(P, Q) := supψ Ψ SMMDµ,kψ,λ(P, Q), gives a distance very different from DΨ MMD (3). Figure 1 (b) shows the vector field for the Optimized SMMD loss in Example 1, using the WGANGP measure µ = Uniform(0, θ). The optimization surface is far more amenable: in particular the location C, which formerly had an extremely small gradient that made learning effectively impossible, now converges very quickly by first reducing the critic gradient until some signal is available. Figure 1 (d) demonstrates that Dµ,Ψ,λ SMMD, like Dµ,Ψ,λ GCMMD and DΨ,λ Lip MMD but in sharp contrast to DΨ MMD, is continuous with respect to the location θ and provides a strong gradient towards 0. We can establish that Dµ,Ψ,λ SMMD is continuous in the Wasserstein topology under some conditions: Theorem 1. Let kψ = K φψ, with φψ : X Rs a fully-connected L-layer network with Leaky-Re LUα activations whose layers do not increase in width, and K satisfying mild smoothness conditions QK < (Assumptions (II) to (V) in Appendix A.2). Let Ψκ be the set of parameters where each layer s weight matrices have condition number cond(W l) = W l / σmin(W l) κ < . If µ has a density (Assumption (I)), then Dµ,Ψκ,λ SMMD(P, Q) QKκL/2 d LαL/2 W(P, Q). Thus if Pn W P, then Dµ,Ψκ,λ SMMD(Pn, P) 0, even if µ is chosen to depend on P and Q. Uniform bounds vs bounds in expectation Controlling fψ 2 L2(µ) = Eµ fψ(X) 2 does not necessarily imply a bound on f Lip supx X fψ(X) , and so does not in general give continuity via Proposition 2. Theorem 1 implies that when the network s weights are well-conditioned, it is sufficient to only control fψ 2 L2(µ), which is far easier in practice than controlling f Lip. If we instead tried to directly controlled f Lip with e.g. spectral normalization (SN) [32], we could significantly reduce the expressiveness of the parametric family. In Example 1, constraining φψ Lip = 1 limits us to only Ψ = {1}. Thus D{1} MMD is simply the MMD with an RBF kernel of bandwidth 1, which has poor gradients when θ is far from 0 (Figure 1 (c), blue). The Cauchy Schwartz bound of Proposition 4 allows jointly adjusting the smoothness of kψ and the critic f, while SN must control the two independently. Relatedly, limiting φ Lip by limiting the Lipschitz norm of each layer could substantially reduce capacity, while fψ L2(µ) need not be decomposed by layer. Another advantage is that µ provides a data-dependent measure of complexity as in [10]: we do not needlessly prevent ourselves from using critics that behave poorly only far from the data. Spectral parametrization When the generator is near a local optimum, the critic might identify only one direction on which Qθ and P differ. If the generator parameterization is such that there is no local way for the generator to correct it, the critic may begin to single-mindedly focus on this difference, choosing redundant convolutional filters and causing the condition number of the weights to diverge. If this occurs, the generator will be motivated to fix this single direction while ignoring all other aspects of the distributions, after which it may become stuck. We can help avoid this collapse by using a critic parameterization that encourages diverse filters with higher-rank weight matrices. Miyato et al. [32] propose to parameterize the weight matrices as W = γ W/ W op, where W op is the spectral norm of W. This parametrization works particularly well with Dµ,Ψ,λ SMMD; Figure 2 (b) shows the singular values of the second layer of a critic s network (and Figure 9, in Appendix F.3, shows more layers), while Figure 2 (d) shows the evolution of the condition number during training. The conditioning of the weight matrix remains stable throughout training with spectral parametrization, while it worsens through training in the default case. 4 Experiments We evaluated unsupervised image generation on three datasets: CIFAR-10 [26] (60 000 images, 32 32), Celeb A [29] (202 599 face images, resized and cropped to 160 160 as in [7]), and the more challenging ILSVRC2012 (Image Net) dataset [41] (1 281 167 images, resized to 64 64). Code for all of these experiments is available at github.com/Michael Arbel/Scaled-MMD-GAN. Losses All models are based on a scalar-output critic network φψ : X R, except MMDGAN-GP where φψ : X R16 as in [7]. The WGAN and Sobolev GAN use a critic f = φψ, while the GAN uses a discriminator Dψ(x) = 1/(1 + exp( φψ(x))). The MMD-based methods use a kernel kψ(x, y) = exp( (φψ(x) φψ(y))2/2), except for MMDGAN-GP which uses a mixture of RQ kernels as in [7]. Increasing the output dimension of the critic or using a different kernel didn t substantially change the performance of our proposed method. We also consider SMMD with a linear top-level kernel, k(x, y) = φψ(x)φψ(y); because this becomes essentially identical to a WGAN (Appendix E), we refer to this method as SWGAN. SMMD and SWGAN use µ = P; Sobolev GAN uses µ = (P + Q)/2 as in [33]. We choose λ and an overall scaling to obtain the losses: SMMD: \ MMD 2 kψ(P, Qθ) 1 + 10 EˆP [ φψ(X) 2 F ], SWGAN: EˆP [φψ(X)] EˆQθ [φψ(X)] q 1 + 10 EˆP [|φψ(X)|2] + EˆP [ φψ(X) 2 F ] . Architecture For CIFAR-10, we used the CNN architecture proposed by [32] with a 7-layer critic and a 4-layer generator. For Celeb A, we used a 5-layer DCGAN discriminator and a 10-layer Res Net generator as in [7]. For Image Net, we used a 10-layer Res Net for both the generator and discriminator. In all experiments we used 64 filters for the smallest convolutional layer, and double it at each layer (Celeb A/Image Net) or every other layer (CIFAR-10). The input codes for the generator are drawn from Uniform [ 1, 1]128 . We consider two parameterizations for each critic: a standard one where the parameters can take any real value, and a spectral parametrization (denoted SN-) as above [32]. Models without explicit gradient control (SN-GAN, SN-MMDGAN, SN-MMGAN-L2, SN-WGAN) fix γ = 1, for spectral normalization; others learn γ, using a spectral parameterization. Training All models were trained for 150 000 generator updates on a single GPU, except for Image Net where the model was trained on 3 GPUs simultaneously. To limit communication overhead we averaged the MMD estimate on each GPU, giving the block MMD estimator [54]. We always used 64 samples per GPU from each of P and Q, and 5 critic updates per generator step. We used initial learning rates of 0.0001 for CIFAR-10 and Celeb A, 0.0002 for Image Net, and decayed these rates using the KID adaptive scheme of [7]: every 2 000 steps, generator samples are compared to those from 20 000 steps ago, and if the relative KID test [9] fails to show an improvement three consecutive times, the learning rate is decayed by 0.8. We used the Adam optimizer [25] with β1 = 0.5, β2 = 0.9. Evaluation To compare the sample quality of different models, we considered three different scores based on the Inception network [49] trained for Image Net classification, all using default parameters in the implementation of [7]. The Inception Score (IS) [42] is based on the entropy of predicted labels; higher values are better. Though standard, this metric has many issues, particularly on datasets other than Image Net [4, 7, 20]. The FID [20] instead measures the similarity of samples from the generator and the target as the Wasserstein-2 distance between Gaussians fit to their intermediate representations. It is more sensible than the IS and becoming standard, but its estimator is strongly biased [7]. The KID [7] is similar to FID, but by using a polynomial-kernel MMD its estimates enjoy better statistical properties and are easier to compare. (A similar score was recommended by [21].) Results Table 1a presents the scores for models trained on both CIFAR-10 and Celeb A datasets. On CIFAR-10, SN-SWGAN and SN-SMMDGAN performed comparably to SN-GAN. But on Celeb A, SN-SWGAN and SN-SMMDGAN dramatically outperformed the other methods with the same architecture in all three metrics. It also trained faster, and consistently outperformed other methods over multiple initializations (Figure 2 (a)). It is worth noting that SN-SWGAN far outperformed WGAN-GP on both datasets. Table 1b presents the scores for SMMDGAN and SN-SMMDGAN trained on Image Net, and the scores of pre-trained models using BGAN [6] and SN-GAN [32].6 The 6These models are courtesy of the respective authors and also trained at 64 64 resolution. SN-GAN used the same architecture as our model, but trained for 250 000 generator iterations; BS-GAN used a similar 5-layer Res Net architecture and trained for 74 epochs, comparable to SN-GAN. 0 2 4 6 8 10 generator iterations 104 35 (a) : KID 103 SMMDGAN SN-SMMDGAN MMDGAN-GP-L2 Sobolev-GAN SN-GAN WGAN-GP SN-SWGAN 0 20 40 60 80 100 120 ith singular value (b) : Singular Values: Layer 2 SMMDGAN : 10K iterations SMMDGAN : 150K iterations SN-SMMDGAN : 10K iterations SN-SMMDGAN : 150K iterations 0 1 2 3 4 5 6 generator iterations 104 104 (c) : Critic Complexity SMMDGAN SN-SMMDGAN MMDGAN SN-MMDGAN 0 1 2 3 4 5 6 generator iterations 104 (d) : Condition Number: Layer 1 SMMDGAN SN-SMMDGAN MMDGAN SN-MMDGAN Figure 2: The training process on Celeb A. (a) KID scores. We report a final score for SN-GAN slightly before its sudden failure mode; MMDGAN and SN-MMDGAN were unstable and had scores around 100. (b) Singular values of the second layer, both early (dashed) and late (solid) in training. (c) σ 2 µ,k,λ for several MMD-based methods. (d) The condition number in the first layer through training. SN alone does not control σµ,k,λ, and SMMD alone does not control the condition number. (a) Scaled MMD GAN with SN (c) Boundary Seeking GAN (d) Scaled MMD GAN with SN (e) Scaled WGAN with SN (f) MMD GAN with GP+L2 Figure 3: Samples from various models. Top: 64 64 Image Net; bottom: 160 160 Celeb A. Table 1: Mean (standard deviation) of score estimates, based on 50 000 samples from each model. (a) CIFAR-10 and Celeb A. CIFAR-10 Celeb A Method IS FID KID 103 IS FID KID 103 WGAN-GP 6.9 0.2 31.1 0.2 22.2 1.1 2.7 0.0 29.2 0.2 22.0 1.0 MMDGAN-GP-L2 6.9 0.1 31.4 0.3 23.3 1.1 2.6 0.0 20.5 0.2 13.0 1.0 Sobolev-GAN 7.0 0.1 30.3 0.3 22.3 1.2 2.9 0.0 16.4 0.1 10.6 0.5 SMMDGAN 7.0 0.1 31.5 0.4 22.2 1.1 2.7 0.0 18.4 0.2 11.5 0.8 SN-GAN 7.2 0.1 26.7 0.2 16.1 0.9 2.7 0.0 22.6 0.1 14.6 1.1 SN-SWGAN 7.2 0.1 28.5 0.2 17.6 1.1 2.8 0.0 14.1 0.2 7.7 0.5 SN-SMMDGAN 7.3 0.1 25.0 0.3 16.6 2.0 2.8 0.0 12.4 0.2 6.1 0.4 (b) Image Net. Method IS FID KID 103 BGAN 10.7 0.4 43.9 0.3 47.0 1.1 SN-GAN 11.2 0.1 47.5 0.1 44.4 2.2 SMMDGAN 10.7 0.2 38.4 0.3 39.3 2.5 SN-SMMDGAN 10.9 0.1 36.6 0.2 34.6 1.6 proposed methods substantially outperformed both methods in FID and KID scores. Figure 3 shows samples on Image Net and Celeb A; Appendix F.4 has more. Spectrally normalized WGANs / MMDGANs To control for the contribution of the spectral parametrization to the performance, we evaluated variants of MMDGANs, WGANs and Sobolev GAN using spectral normalization (in Table 2, Appendix F.3). WGAN and Sobolev-GAN led to unstable training and didn t converge at all (Figure 11) despite many attempts. MMDGAN converged on CIFAR-10 (Figure 11) but was unstable on Celeb A (Figure 10). The gradient control due to SN is thus probably too loose for these methods. This is reinforced by Figure 2 (c), which shows that the expected gradient of the critic network is much better-controlled by SMMD, even when SN is used. We also considered variants of these models with a learned γ while also adding a gradient penalty and an L2 penalty on critic activations [7, footnote 19]. These generally behaved similarly to MMDGAN, and didn t lead to substantial improvements. We ran the same experiments on Celeb A, but aborted the runs early when it became clear that training was not successful. Rank collapse We occasionally observed the failure mode for SMMD where the critic becomes low-rank, discussed in Section 3.3, especially on Celeb A; this failure was obvious even in the training objective. Figure 2 (b) is one of these examples. Spectral parametrization seemed to prevent this behavior. We also found one could avoid collapse by reverting to an earlier checkpoint and increasing the RKHS regularization parameter λ, but did not do this for any of the experiments here. 5 Conclusion We studied gradient regularization for MMD-based critics in implicit generative models, clarifying how previous techniques relate to the DΨ MMD loss. Based on these insights, we proposed the Gradient Constrained MMD and its approximation the Scaled MMD, a new loss function for IGMs that controls gradient behavior in a principled way and obtains excellent performance in practice. One interesting area of future study for these distances is their behavior when used to diffuse particles distributed as Q towards particles distributed as P. Mroueh et al. [33, Appendix A.1] began such a study for the Sobolev GAN loss; [35] proved convergence and studied discrete-time approximations. Another area to explore is the geometry of these losses, as studied by Bottou et al. [8], who showed potential advantages of the Wasserstein geometry over the MMD. Their results, though, do not address any distances based on optimized kernels; the new distances introduced here might have interesting geometry of their own. [1] B. Amos and J. Z. Kolter. Opt Net: Differentiable Optimization as a Layer in Neural Networks. ICML. 2017. ar Xiv: 1703.00443. [2] M. Arjovsky and L. Bottou. Towards Principled Methods for Training Generative Adversarial Networks. ICLR. 2017. ar Xiv: 1701.04862. [3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein Generative Adversarial Networks. ICML. 2017. ar Xiv: 1701.07875. [4] S. Barratt and R. Sharma. A Note on the Inception Score. 2018. ar Xiv: 1801.01973. [5] M. G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lakshminarayanan, S. Hoyer, and R. Munos. The Cramer Distance as a Solution to Biased Wasserstein Gradients. 2017. ar Xiv: 1705.10743. [6] D. Berthelot, T. Schumm, and L. Metz. BEGAN: Boundary Equilibrium Generative Adversarial Networks. 2017. ar Xiv: 1703.10717. [7] M. Bi nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. ICLR. 2018. ar Xiv: 1801.01401. [8] L. Bottou, M. Arjovsky, D. Lopez-Paz, and M. Oquab. Geometrical Insights for Implicit Generative Modeling. Braverman Readings in Machine Learning: Key Iedas from Inception to Current State. Ed. by L. Rozonoer, B. Mirkin, and I. Muchnik. LNAI Vol. 11100. Springer, 2018, pp. 229 268. ar Xiv: 1712.07822. [9] W. Bounliphone, E. Belilovsky, M. B. Blaschko, I. Antonoglou, and A. Gretton. A Test of Relative Similarity For Model Selection in Generative Models. ICLR. 2016. ar Xiv: 1511. 04581. [10] O. Bousquet, O. Chapelle, and M. Hein. Measure Based Regularization. NIPS. 2004. [11] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural Photo Editing with Introspective Adversarial Networks. ICLR. 2017. ar Xiv: 1609.07093. [12] R. M. Dudley. Real Analysis and Probability. 2nd ed. Cambridge University Press, 2002. [13] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via Maximum Mean Discrepancy optimization. UAI. 2015. ar Xiv: 1505.03906. [14] A. Genevay, G. Peyré, and M. Cuturi. Learning Generative Models with Sinkhorn Divergences. AISTATS. 2018. ar Xiv: 1706.00292. [15] T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. JASA 102.477 (2007), pp. 359 378. [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Nets. NIPS. 2014. ar Xiv: 1406.2661. [17] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A Kernel Two Sample Test. JMLR 13 (2012). [18] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved Training of Wasserstein GANs. NIPS. 2017. ar Xiv: 1704.00028. [19] A. Güngör. Some bounds for the product of singular values. International Journal of Contemporary Mathematical Sciences (2007). [20] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium. NIPS. 2017. ar Xiv: 1706.08500. [21] G. Huang, Y. Yuan, Q. Xu, C. Guo, Y. Sun, F. Wu, and K. Weinberger. An empirical study on evaluation metrics of generative adversarial networks. 2018. ar Xiv: 1806.07755. [22] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal Unsupervised Image-to-Image Translation. ECCV. 2018. ar Xiv: 1804.04732. [23] Y. Jin, K. Zhang, M. Li, Y. Tian, H. Zhu, and Z. Fang. Towards the Automatic Anime Characters Creation with Generative Adversarial Networks. 2017. ar Xiv: 1708.05509. [24] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive Growing of GANs for Improved Quality, Stability, and Variation. ICLR. 2018. ar Xiv: 1710.10196. [25] D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ICLR. 2015. ar Xiv: 1412.6980. [26] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009. [27] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Póczos. MMD GAN: Towards Deeper Understanding of Moment Matching Network. NIPS. 2017. ar Xiv: 1705.08584. [28] Y. Li, K. Swersky, and R. Zemel. Generative Moment Matching Networks. ICML. 2015. ar Xiv: 1502.02761. [29] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. ICCV. 2015. ar Xiv: 1411.7766. [30] L. Mescheder, A. Geiger, and S. Nowozin. Which Training Methods for GANs do actually Converge? ICML. 2018. ar Xiv: 1801.04406. [31] P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica 70.2 (2002), pp. 583 601. [32] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral Normalization for Generative Adversarial Networks. ICLR. 2018. ar Xiv: 1802.05927. [33] Y. Mroueh, C.-L. Li, T. Sercu, A. Raj, and Y. Cheng. Sobolev GAN. ICLR. 2018. ar Xiv: 1711.04894. [34] Y. Mroueh and T. Sercu. Fisher GAN. NIPS. 2017. ar Xiv: 1705.09675. [35] Y. Mroueh, T. Sercu, and A. Raj. Regularized Kernel and Neural Sobolev Descent: Dynamic MMD Transport. 2018. ar Xiv: 1805.12062. [36] A. Müller. Integral Probability Metrics and their Generating Classes of Functions. Advances in Applied Probability 29.2 (1997), pp. 429 443. [37] S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. NIPS. 2016. ar Xiv: 1606.00709. [38] A. Radford, L. Metz, and S. Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR. 2016. ar Xiv: 1511.06434. [39] J. R. Retherford. Review: J. Diestel and J. J. Uhl, Jr., Vector measures. Bull. Amer. Math. Soc. 84.4 (July 1978), pp. 681 685. [40] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing Training of Generative Adversarial Networks through Regularization. NIPS. 2017. ar Xiv: 1705.09367. [41] O. Russakovsky et al. Image Net Large Scale Visual Recognition Challenge. 2014. ar Xiv: 1409.0575. [42] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved Techniques for Training GANs. NIPS. 2016. ar Xiv: 1606.03498. [43] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [44] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, G. R. G. Lanckriet, and B. Schölkopf. Kernel choice and classifiability for RKHS embeddings of probability distributions. NIPS. 2009. [45] B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, Characteristic Kernels and RKHS Embedding of Measures. JMLR 12 (2011), pp. 2389 2410. ar Xiv: 1003. 0887. [46] B. Sriperumbudur. On the optimal estimation of probability mesaures in weak and strong topologies. Bernoulli 22.3 (2016), pp. 1839 1893. ar Xiv: 1310.8240. [47] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008. [48] D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton. Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy. ICLR. 2017. ar Xiv: 1611.04488. [49] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception Architecture for Computer Vision. CVPR. 2016. ar Xiv: 1512.00567. [50] T. Unterthiner, B. Nessler, C. Seward, G. Klambauer, M. Heusel, H. Ramsauer, and S. Hochreiter. Coulomb GANs: Provably Optimal Nash Equilibria via Potential Fields. ICLR. 2018. ar Xiv: 1708.08819. [51] C. Villani. Optimal Transport: Old and New. Springer, 2009. [52] J. Weed and F. Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli (forthcoming). ar Xiv: 1707.00087. [53] H. Wendland. Scattered Data Approximation. Cambridge University Press, 2005. [54] W. Zaremba, A. Gretton, and M. B. Blaschko. B-tests: Low Variance Kernel Two-Sample Tests. NIPS. 2013. ar Xiv: 1307.1954.