# improving_mmdgan_training_with_repulsive_loss_function__e8ecc05b.pdf Published as a conference paper at ICLR 2019 IMPROVING MMD-GAN TRAINING WITH REPULSIVE LOSS FUNCTION Wei Wang University of Melbourne Yuan Sun RMIT University Saman Halgamuge University of Melbourne Generative adversarial nets (GANs) are widely used to learn the data sampling process and their performance may heavily depend on the loss functions, given a limited computational budget. This study revisits MMD-GAN that uses the maximum mean discrepancy (MMD) as the loss function for GAN and makes two contributions. First, we argue that the existing MMD loss function may discourage the learning of fine details in data as it attempts to contract the discriminator outputs of real data. To address this issue, we propose a repulsive loss function to actively learn the difference among the real data by simply rearranging the terms in MMD. Second, inspired by the hinge loss, we propose a bounded Gaussian kernel to stabilize the training of MMD-GAN with the repulsive loss function. The proposed methods are applied to the unsupervised image generation tasks on CIFAR-10, STL-10, Celeb A, and LSUN bedroom datasets. Results show that the repulsive loss function significantly improves over the MMD loss at no additional computational cost and outperforms other representative loss functions. The proposed methods achieve an FID score of 16.21 on the CIFAR-10 dataset using a single DCGAN network and spectral normalization. 1 1 INTRODUCTION Generative adversarial nets (GANs) (Goodfellow et al. (2014)) are a branch of generative models that learns to mimic the real data generating process. GANs have been intensively studied in recent years, with a variety of successful applications (Karras et al. (2018); Li et al. (2017b); Lai et al. (2017); Zhu et al. (2017); Ho & Ermon (2016)). The idea of GANs is to jointly train a generator network that attempts to produce artificial samples, and a discriminator network or critic that distinguishes the generated samples from the real ones. Compared to maximum likelihood based methods, GANs tend to produce samples with sharper and more vivid details but require more efforts to train. Recent studies on improving GAN training have mainly focused on designing loss functions, network architectures and training procedures. The loss function, or simply loss, defines quantitatively the difference of discriminator outputs between real and generated samples. The gradients of loss functions are used to train the generator and discriminator. This study focuses on a loss function called maximum mean discrepancy (MMD), which is well known as the distance metric between two probability distributions and widely applied in kernel two-sample test (Gretton et al. (2012)). Theoretically, MMD reaches its global minimum zero if and only if the two distributions are equal. Thus, MMD has been applied to compare the generated samples to real ones directly (Li et al. (2015); Dziugaite et al. (2015)) and extended as the loss function to the GAN framework recently (Unterthiner et al. (2018); Li et al. (2017a); Bi nkowski et al. (2018)). In this paper, we interpret the optimization of MMD loss by the discriminator as a combination of attraction and repulsion processes, similar to that of linear discriminant analysis. We argue that the existing MMD loss may discourage the learning of fine details in data, as the discriminator attempts to minimize the within-group variance of its outputs for the real data. To address this issue, we propose a repulsive loss for the discriminator that explicitly explores the differences among real data. The proposed loss achieved significant improvements over the MMD loss on image generation Corresponding author: weiw8@student.unimelb.edu.au 1The code is available at: https://github.com/richardwth/MMD-GAN Published as a conference paper at ICLR 2019 tasks of four benchmark datasets, without incurring any additional computational cost. Furthermore, a bounded Gaussian kernel is proposed to stabilize the training of discriminator. As such, using a single kernel in MMD-GAN is sufficient, in contrast to a linear combination of kernels used in Li et al. (2017a) and Bi nkowski et al. (2018). By using a single kernel, the computational cost of the MMD loss can potentially be reduced in a variety of applications. The paper is organized as follows. Section 2 reviews the GANs trained using the MMD loss (MMDGAN). We propose the repulsive loss for discriminator in Section 3, introduce two practical techniques to stabilize the training process in Section 4, and present the results of extensive experiments in Section 5. In the last section, we discuss the connections between our model and existing work. In this section, we introduce the GAN model and MMD loss. Consider a random variable X X with an empirical data distribution PX to be learned. A typical GAN model consists of two neural networks: a generator G and a discriminator D. The generator G maps a latent code z with a fixed distribution PZ (e.g., Gaussian) to the data space X: y = G(z) X, where y represents the generated samples with distribution PG. The discriminator D evaluates the scores D(a) Rd of a real or generated sample a. This study focuses on image generation tasks using convolutional neural networks (CNN) for both G and D. Several loss functions have been proposed to quantify the difference of the scores between real and generated samples: {D(x)} and {D(y)}, including the minimax loss and non-saturating loss (Goodfellow et al. (2014)), hinge loss (Tran et al. (2017)), Wasserstein loss (Arjovsky et al. (2017); Gulrajani et al. (2017)) and maximum mean discrepancy (MMD) (Li et al. (2017a); Bi nkowski et al. (2018)) (see Appendix B.1 for more details). Among them, MMD uses kernel embedding φ(a) = k( , a) associated with a characteristic kernel k such that φ is infinite-dimensional and φ(a), φ(b) H = k(a, b). The squared MMD distance between two distributions P and Q is M 2 k(P, Q) = µP µQ 2 H = Ea,a P [k(a, a )] + Eb,b Q[k(b, b )] 2Ea P,b Q[k(a, b)] (1) The kernel k(a, b) measures the similarity between two samples a and b. Gretton et al. (2012) proved that, using a characteristic kernel k, M 2 k(P, Q) 0 with equality applies if and only if P = Q. In MMD-GAN, the discriminator D can be interpreted as forming a new kernel with k: k D(a, b) = k(D(a), D(b)) = k D(a, b). If D is injective, k D is characteristic and M 2 k D(PX, PG) reaches its minimum if and only if PX = PG (Li et al. (2017a)). Thus, the objective functions for G and D could be (Li et al. (2017a); Bi nkowski et al. (2018)): min G Lmmd G = M 2 k D(PX, PG) = EPG[k D(y, y )] 2EPX,PG[k D(x, y)] + EPX[k D(x, x )] (2) min D Latt D = M 2 k D(PX, PG) = 2EPX,PG[k D(x, y)] EPX[k D(x, x )] EPG[k D(y, y )] (3) MMD-GAN has been shown to be more effective than the model that directly uses MMD as the loss function for the generator G (Li et al. (2017a)). Liu et al. (2017) showed that MMD and Wasserstein metric are weaker objective functions for GAN than the Jensen Shannon (JS) divergence (related to minimax loss) and total variation (TV) distance (related to hinge loss). The reason is that convergence of PG to PX in JS-divergence and TV distance also implies convergence in MMD and Wasserstein metric. Weak metrics are desirable as they provide more information on adjusting the model to fit the data distribution (Liu et al. (2017)). Nagarajan & Kolter (2017) proved that the GAN trained using the minimax loss and gradient updates on model parameters is locally exponentially stable near equilibrium, while the GAN using Wasserstein loss is not. In Appendix A, we demonstrate that the MMD-GAN trained by gradient descent is locally exponentially stable near equilibrium. 3 REPULSIVE LOSS FUNCTION In this section, we interpret the training of MMD-GAN (using Latt D and Lmmd G ) as a combination of attraction and repulsion processes, and propose a novel repulsive loss function for the discriminator by rearranging the components in Latt D. Published as a conference paper at ICLR 2019 (a) Latt D (Eq. 3) (b) Lrep D (Eq. 4) (c) Lmmd G paired with Lrep D Figure 1: Illustration of the gradient directions of each loss on the real sample scores {D(x)} ( r nodes) and generated sample scores {D(y)} ( g nodes). The blue arrows stand for attraction and the orange arrows for repulsion. When Lmmd G is paired with Latt D, the gradient directions of Lmmd G on {D(y)} can be obtained by reversing the arrows in (a), thus are omitted. First, consider a linear discriminant analysis (LDA) model as the discriminator. The task is to find a projection w to maximize the between-group variance w T µx w T µy and minimize the withingroup variance w T (Σx + Σy)w, where µ and Σ are group mean and covariance. In MMD-GAN, the neural-network discriminator works in a similar way as LDA. By minimizing Latt D, the discriminator D tackles two tasks: 1) D reduces EPX,PG[k D(x, y)], i.e., causes the two groups {D(x)} and {D(y)} to repel each other (see Fig. 1a orange arrows), or maximize between group variance; and 2) D increases EPX[k D(x, x )] and EPG[k(y, y )], i.e. contracts {D(x)} and {D(y)} within each group (see Fig. 1a blue arrows), or minimize the within-group variance. We refer to loss functions that contract real data scores as attractive losses. We argue that the attractive loss Latt D (Eq. 3) has two issues that may slow down the GAN training: 1. The discriminator D may focus more on the similarities among real samples (in order to contract {D(x)}) than the fine details that separate them. Initially, G produces low-quality samples and it may be adequate for D to learn the common features of {x} in order to distinguish between {x} and {y}. Only when {D(y)} is sufficiently close to {D(x)} will D learn the fine details of {x} to be able to separate {D(x)} from {D(y)}. Consequently, D may leave out some fine details in real samples, thus G has no access to them during training. 2. As shown in Fig. 1a, the gradients on D(y) from the attraction (blue arrows) and repulsion (orange arrows) terms in Latt D (and thus Lmmd G ) may have opposite directions during training. Their summation may be small in magnitude even when D(y) is far away from D(x), which may cause G to stagnate locally. Therefore, we propose a repulsive loss for D to encourage repulsion of the real data scores {D(x)}: Lrep D = EPX[k D(x, x )] EPG[k D(y, y )] (4) The generator G uses the same MMD loss Lmmd G as before (see Eq. 2). Thus, the adversary lies in the fact that D contracts {D(y)} via maximizing EPG[k D(y, y )] (see Fig. 1b) while G expands {D(y)} (see Fig. 1c). Additionally, D also learns to separate the real data by minimizing EPX[k D(x, x )], which actively explores the fine details in real samples and may result in more meaningful gradients for G. Note that in Eq. 4, D does not explicitly push the average score of {D(y)} away from that of {D(x)} because it may have no effect on the pair-wise sample distances. But G aims to match the average scores of both groups. Thus, we believe, compared to the model using Lmmd G and Latt D, our model of Lmmd G and Lrep D is less likely to yield opposite gradients when {D(y)} and {D(x)} are distinct (see Fig. 1c). In Appendix A, we demonstrate that GAN trained using gradient descent and the repulsive MMD loss (Lrep D , Lmmd G ) is locally exponentially stable near equilibrium. At last, we identify a general form of loss function for the discriminator D: LD,λ = λEPX[k D(x, x )] (λ 1)EPX,PG[k D(x, y)] EPG[k D(y, y )] (5) Published as a conference paper at ICLR 2019 0 2 4 6 8 10 0 0 2 4 6 8 10 0.6 0 2 4 6 8 10 0 0.2 0.5 1 2 5 Mean 0 2 4 6 8 10 0.6 Figure 2: (a) Gaussian kernels {krbf σi (a, b)} and their mean as a function of e = a b , where σi {1, 2, 4} were used in our experiments; (b) derivatives of {krbf σi (a, b)} in (a); (c) rational quadratic kernel {krq αi(a, b)} and their mean, where αi {0.2, 0.5, 1, 2, 5}; (d) derivatives of {krq αi(a, b)} in (c). where λ is a hyper-parameter2. When λ < 0, the discriminator loss LD,λ is attractive, with λ = 1 corresponding to the original MMD loss Latt D in Eq. 3; when λ > 0, LD,λ is repulsive and λ = 1 corresponds to Lrep D in Eq. 4. It is interesting that when λ > 1, the discriminator explicitly contracts {D(x)} and {D(y)} via maximizing EPX,PG[k D(x, y)], which may work as a penalty that prevents the pairwise distances of {D(x)} from increasing too fast. Note that LD,λ has the same computational cost as Latt D (Eq. 3) as we only rearranged the terms in Latt D. 4 REGULARIZATION ON MMD AND DISCRIMINATOR In this section, we propose two approaches to stabilize the training of MMD-GAN: 1) a bounded kernel to avoid the saturation issue caused by an over-confident discriminator; and 2) a generalized power iteration method to estimate the spectral norm of a convolutional kernel, which was used in spectral normalization on the discriminator in all experiments in this study unless specified otherwise. 4.1 KERNEL IN MMD For MMD-GAN, the following two kernels have been used: Gaussian radial basis function (RBF), or Gaussian kernel (Li et al. (2017a)), krbf σ (a, b) = exp( 1 2σ2 a b 2) where σ > 0 is the kernel scale or bandwidth. Rational quadratic kernel (Bi nkowski et al. (2018)), krq α(a, b) = (1+ 1 2α a b 2) α, where the kernel scale α > 0 corresponds to a mixture of Gaussian kernels with a Gamma(α, 1) prior on the inverse kernel scales σ 1. It is interesting that both studies used a linear combination of kernels with five different kernel scales, i.e., krbf = P5 i=1 krbf σi and krq = P5 i=1 krq αi, where σi {1, 2, 4, 8, 16}, αi {0.2, 0.5, 1, 2, 5} (see Fig. 2a and 2c for illustration). We suspect the reason is that a single kernel k(a, b) is saturated when the distance a b is either too large or too small compared to the kernel scale (see Fig. 2b and 2d), which may cause diminishing gradients during training. Both Li et al. (2017a) and Bi nkowski et al. (2018) applied penalties on the discriminator parameters but not to the MMD loss itself. Thus the saturation issue may still exist. Using a linear combination of kernels with different kernel scales may alleviate this issue but not eradicate it. Inspired by the hinge loss (see Appendix B.1), we propose a bounded RBF (RBF-B) kernel for the discriminator. The idea is to prevent D from pushing {D(x)} too far away from {D(y)} and causing saturation. For Latt D in Eq. 3, the RBF-B kernel is: krbf-b σ (a, b) = 2σ2 max( a b 2 , bl)) if a, b {D(x)} or a, b {D(y)} exp( 1 2σ2 min( a b 2 , bu)) if a {D(x)} and b {D(y)} (6) 2The weights for the three terms in LD,λ sum up to zero. This is to make sure the LD,λ/ θD is zero at equilibrium PX = PG, where θD is the parameters of D. Published as a conference paper at ICLR 2019 For Lrep D in Eq. 4, the RBF-B kernel is: krbf-b σ (a, b) = 2σ2 max( a b 2 , bl)) if a, b {D(y)} exp( 1 2σ2 min( a b 2 , bu)) if a, b {D(x)} (7) where bl and bu are the lower and upper bounds. As such, a single kernel is sufficient and we set σ = 1, bl = 0.25 and bu = 4 in all experiments for simplicity and leave their tuning for future work. It should be noted that, like the case of hinge loss, the RBF-B kernel is used only for the discriminator to prevent it from being over-confident. The generator is always trained using the original RBF kernel, thus we retain the interpretation of MMD loss Lmmd G as a metric. RBF-B kernel is among many methods to address the saturation issue and stabilize MMD-GAN training. We found random sampling kernel scale, instance noise (Sønderby et al. (2017)) and label smoothing (Szegedy et al. (2016); Salimans et al. (2016)) may also improve the model performance and stability. However, the computational cost of RBF-B kernel is relatively low. 4.2 SPECTRAL NORMALIZATION IN DISCRIMINATOR Without any Lipschitz constraints, the discriminator D may simply increase the magnitude of its outputs to minimize the discriminator loss, causing unstable training3. Spectral normalization divides the weight matrix of each layer by its spectral norm, which imposes an upper bound on the magnitudes of outputs and gradients at each layer of D (Miyato et al. (2018)). However, to estimate the spectral norm of a convolution kernel, Miyato et al. (2018) reshaped the kernel into a matrix. We propose a generalized power iteration method to directly estimate the spectral norm of a convolution kernel (see Appendix C for details) and applied spectral normalization to the discriminator in all experiments. In Appendix D.1, we explore using gradient penalty to impose the Lipschitz constraint (Gulrajani et al. (2017); Bi nkowski et al. (2018); Arbel et al. (2018)) for the proposed repulsive loss. 5 EXPERIMENTS In this section, we empirically evaluate the proposed 1) repulsive loss Lrep D (Eq. 4) on unsupervised training of GAN for image generation tasks; and 2) RBF-B kernel to stabilize MMD-GAN training. The generalized power iteration method is evaluated in Appendix C.3. To show the efficacy of Lrep D , we compared the loss functions (Lrep D , Lmmd G ) using Gaussian kernel (MMD-rep) with (Latt D, Lmmd G ) using Gaussian kernel (MMD-rbf) (Li et al. (2017a)) and rational quadratic kernel (MMD-rq) (Bi nkowski et al. (2018)), as well as non-saturating loss (Goodfellow et al. (2014)) and hinge loss (Tran et al. (2017)). To show the efficacy of RBF-B kernel, we applied it to both Latt D and Lrep D , resulting in two methods MMD-rbf-b and MMD-rep-b. The Wasserstein loss was excluded for comparison because it cannot be directly used with spectral normalization ( Miyato et al. (2018)). 5.1 EXPERIMENT SETUP Dataset: The loss functions were evaluated on four datasets: 1) CIFAR-10 (50K images, 32 32 pixels) (Krizhevsky & Hinton (2009)); 2) STL-10 (100K images, 48 48 pixels) (Coates et al. (2011)); 3) Celeb A (about 203K images, 64 64 pixels) (Liu et al. (2015)); and 4) LSUN bedrooms (around 3 million images, 64 64 pixels) (Yu et al. (2015)). The images were scaled to range [ 1, 1] to avoid numeric issues. Network architecture: The DCGAN (Radford et al. (2016)) architecture was used with hyperparameters from Miyato et al. (2018) (see Appendix B.2 for details). In all experiments, batch normalization (BN) (Ioffe & Szegedy (2015)) was used in the generator, and spectral normalization with the generalized power iteration (see Appendix C) in the discriminator. For MMD related losses, the dimension of discriminator output layer was set to 16; for non-saturating loss and hinge loss, it was 1. In Appendix D.2, we investigate the impact of discriminator output dimension on the performance of repulsive loss. 3Note that training stability is different from the local stability considered in Appendix A. Training stability often refers to the ability of model converging to a desired state measured by some criterion. Local stability means that if a model is initialized sufficiently close to an equilibrium, it will converge to the equilibrium. Published as a conference paper at ICLR 2019 Table 1: Inception score (IS), Fr echet Inception distance (FID) and multi-scale structural similarity (MS-SSIM) on image generation tasks using different loss functions Methods1 CIFAR-10 STL-10 Celeb A2 LSUN-bedrom2 IS FID IS FID FID MS-SSIM FID MS-SSIM Real data 11.31 2.09 26.37 2.10 1.09 0.2678 1.24 0.0915 Non-saturating 7.39 23.23 8.25 48.53 10.64 0.2895 23.66 0.1027 Hinge 7.33 23.46 8.24 49.44 8.60 0.2894 16.73 0.0946 MMD-rbf3 7.05 28.38 8.13 57.52 13.03 0.2937 MMD-rq3 7.22 27.00 8.11 54.05 12.74 0.2935 MMD-rbf-b 7.18 25.25 8.07 51.86 10.09 0.3090 32.29 0.1001 MMD-rep 7.99 16.65 9.36 36.67 7.20 0.2761 16.91 0.0901 MMD-rep-b 8.29 16.21 9.34 37.63 6.79 0.2659 12.52 0.0908 1 The models here differ only by the loss functions and dimension of discriminator outputs. See Section 5.1. 2 For Celeb A and LSUN-bedroom, IS is not meaningful (Bi nkowski et al. (2018)) and thus omitted. 3 On LSUN-bedroom, MMD-rbf and MMD-rq did not achieve reasonable results and thus are omitted. Hyper-parameters: We used Adam optimizer (Kingma & Ba (2015)) with momentum parameters β1 = 0.5, β2 = 0.999; two-timescale update rule (TTUR) (Heusel et al. (2017)) with two learning rates (ρD, ρG) chosen from {1e-4, 2e-4, 5e-4, 1e-3} (16 combinations in total); and batch size 64. Fine-tuning on learning rates may improve the model performance, but constant learning rates were used for simplicity. All models were trained for 100K iterations on CIFAR-10, STL-10, Celeb A and LSUN bedroom datasets, with ndis = 1, i.e., one discriminator update per generator update4. For MMD-rbf, the kernel scales σi {1, 2, 4} were used due to a better performance than the original values used in Li et al. (2017a). For MMD-rq, αi {0.2, 0.5, 1, 2, 5}. For MMD-rbf-b, MMD-rep, MMD-rep-b, a single Gaussian kernel with σ = 1 was used. Evaluation metrics: Inception score (IS) (Salimans et al. (2016)), Fr echet Inception distance (FID) (Heusel et al. (2017)) and multi-scale structural similarity (MS-SSIM) (Wang et al. (2003)) were used for quantitative evaluation. Both IS and FID are calculated using a pre-trained Inception model (Szegedy et al. (2016)). Higher IS and lower FID scores indicate better image quality. MS-SSIM calculates the pair-wise image similarity and is used to detect mode collapses among images of the same class (Odena et al. (2017)). Lower MS-SSIM values indicate perceptually more diverse images. For each model, 50K randomly generated samples and 50K real samples were used to calculate IS, FID and MS-SSIM. 5.2 QUANTITATIVE ANALYSIS Table 1 shows the Inception score, FID and MS-SSIM of applying different loss functions on the benchmark datasets with the optimal learning rate combinations tested experimentally. Note that the same training setup (i.e., DCGAN + BN + SN + TTUR) was applied for each loss function. We observed that: 1) MMD-rep and MMD-rep-b performed significantly better than MMD-rbf and MMD-rbf-b respectively, showing the proposed repulsive loss Lrep D (Eq. 4) greatly improved over the attractive loss Latt D (Eq. 3); 2) Using a single kernel, MMD-rbf-b performed better than MMD-rbf and MMD-rq which used a linear combination of five kernels, indicating that the kernel saturation may be an issue that slows down MMD-GAN training; 3) MMD-rep-b performed comparable or better than MMD-rep on benchmark datasets where we found the RBF-B kernel managed to stabilize MMD-GAN training using repulsive loss. 4) MMD-rep and MMD-rep-b performed significantly better than the non-saturating and hinge losses, showing the efficacy of the proposed repulsive loss. Additionally, we trained MMD-GAN using the general loss LD,λ (Eq. 5) for discriminator and Lmmd G (Eq. 2) for generator on the CIFAR-10 dataset. Fig. 3 shows the influence of λ on the performance 4Increasing or decreasing ndis may improve the model performance in some cases, but it has significant impact on the computation cost. For simple and fair comparison, we set ndis = 1 in all cases. Published as a conference paper at ICLR 2019 λ = 1 λ = 0.5 λ = 0 λ = 0.5 λ = 1 λ = 2 16 18 20 22 24 26 (a) MMD-GAN trained using a single RBF kernel in LD,λ λ = 1 λ = 0.5 λ = 0 λ = 0.5 λ = 1 λ = 2 16 18 20 22 24 26 (b) MMD-GAN trained using a single RBF-B kernel in LD,λ Figure 3: FID scores of MMD-GAN using (a) RBF kernel and (b) RBF-B kernel in LD,λ on CIFAR10 dataset for 16 learning rate combinations. Each color bar represents the FID score using a learning rate combination (ρD, ρG), in the order of (1e-4, 1e-4), (1e-4, 2e-4),...,(1e-3, 1e-3). The discriminator was trained using LD,λ (Eq. 5) with λ {-1, -0.5, 0, 0.5, 1, 2}, and generator using Lmmd G (Eq. 2). We use the FID> 30 to indicate that the model diverged or produced poor results. of MMD-GAN with RBF and RBF-B kernel5. Note that when λ = 1, the models are essentially MMD-rbf (with a single Gaussian kernel) and MMD-rbf-b when RBF and RBF-B kernel are used respectively. We observed that: 1) the model performed well using repulsive loss (i.e., λ 0), with λ = 0.5, 1 slightly better than λ = 0.5, 0, 2; 2) the MMD-rbf model can be significantly improved by simply increasing λ from 1 to 0.5, which reduces the attraction of discriminator on real sample scores; 3) larger λ may lead to more diverged models, possibly because the discriminator focuses more on expanding the real sample scores over adversarial learning; note when λ 1, the model would simply learn to expand all real sample scores and pull the generated sample scores to real samples , which is a divergent process; 4) the RBF-B kernel managed to stabilize MMD-rep for most diverged cases but may occasionally cause the FID score to rise up. The proposed methods were further evaluated in Appendix A, C and D. In Appendix A.2, we used a simulation study to show the local stability of MMD-rep trained by gradient descent, while its global stability is not guaranteed as bad initialization may lead to trivial solutions. The problem may be alleviated by adjusting the learning rate for generator. In Appendix C.3, we showed the proposed generalized power iteration (Section 4.2) imposes a stronger Lipschitz constraint than the method in Miyato et al. (2018), and benefited MMD-GAN training using the repulsive loss. Moreover, the RBF-B kernel managed to stabilize the MMD-GAN training for various configurations of the spectral normalization method. In Appendix D.1, we showed the gradient penalty can also be used with the repulsive loss. In Appendix D.2, we showed that it was better to use more than one neuron at the discriminator output layer for the repulsive loss. 5.3 QUALITATIVE ANALYSIS The discriminator outputs may be interpreted as a learned representation of the input samples. Fig. 4 visualizes the discriminator outputs learned by the MMD-rbf and proposed MMD-rep methods on CIFAR-10 dataset using t-SNE (van der Maaten (2014)). MMD-rbf ignored the class structure in data (see Fig. 4a) while MMD-rep learned to concentrate the data from the same class and separate different classes to some extent (Fig. 4b). This is because the discriminator D has to actively learn 5For λ < 0, the RBF-B kernel in Eq. 6 was used in LD,λ. Published as a conference paper at ICLR 2019 (a) MMD-rbf (b) MMD-rep Figure 4: t-SNE visualization of discriminator outputs {D(x)} learned by (a) MMD-rbf and (b) MMD-rep for 2560 real samples from the CIFAR-10 dataset, colored by their class labels. the data structure in order to expands the real sample scores {D(x)}. Thus, we speculate that techniques reinforcing the learning of cluster structures in data may further improve the training of MMD-GAN. In addition, the performance gain of proposed repulsive loss (Eq. 4) over the attractive loss (Eq. 3) comes at no additional computational cost. In fact, by using a single kernel rather than a linear combination of kernels, MMD-rep and MMD-rep-b are simpler than MMD-rbf and MMD-rq. Besides, given a typically small batch size and a small number of discriminator output neurons (64 and 16 in our experiments), the cost of MMD over the non-saturating and hinge loss is marginal compared to the convolution operations. In Appendix D.3, we provide some random samples generated by the methods in our study. 6 DISCUSSION This study extends the previous work on MMD-GAN (Li et al. (2017a)) with two contributions. First, we interpreted the optimization of MMD loss as a combination of attraction and repulsion processes, and proposed a repulsive loss for the discriminator that actively learns the difference among real data. Second, we proposed a bounded Gaussian RBF (RBF-B) kernel to address the saturation issue. Empirically, we observed that the repulsive loss may result in unstable training, due to factors including initialization (Appendix A.2), learning rate (Fig. 3b) and Lipschitz constraints on the discriminator (Appendix C.3). The RBF-B kernel managed to stabilize the MMD-GAN training in many cases. Tuning the hyper-parameters in RBF-B kernel or using other regularization methods may further improve our results. The theoretical advantages of MMD-GAN require the discriminator to be injective. The proposed repulsive loss (Eq. 4) attempts to realize this by explicitly maximizing the pair-wise distances among the real samples. Li et al. (2017a) achieved the injection property by using the discriminator as the encoder and an auxiliary network as the decoder to reconstruct the real and generated samples, which is more computationally extensive than our proposed approach. On the other hand, Bi nkowski et al. (2018); Arbel et al. (2018) imposed a Lipschitz constraint on the discriminator in MMD-GAN via gradient penalty, which may not necessarily promote an injective discriminator. The idea of repulsion on real sample scores is in line with existing studies. It has been widely accepted that the quality of generated samples can be significantly improved by integrating labels (Odena et al. (2017); Miyato & Koyama (2018); Zhou et al. (2018)) or even pseudo-labels generated by k-means method (Grinblat et al. (2017)) in the training of discriminator. The reason may be that the labels help concentrate the data from the same class and separate those from different classes. Using a pre-trained classifier may also help produce vivid image samples (Huang et al. (2017)) as the learned representations of the real samples in the hidden layers of the classifier tend to be well separated/organized and may produce more meaningful gradients to the generator. Published as a conference paper at ICLR 2019 At last, we note that the proposed repulsive loss is orthogonal to the GAN studies on designing network structures and training procedures, and thus may be combined with a variety of novel techniques. For example, the Res Net architecture (He et al. (2016)) has been reported to outperform the plain DCGAN used in our experiments on image generation tasks (Miyato et al. (2018); Gulrajani et al. (2017)) and self-attention module may further improve the results (Zhang et al. (2018)). On the other hand, Karras et al. (2018) proposed to progressively grows the size of both discriminator and generator and achieved the state-of-the-art performance on unsupervised training of GANs on the CIFAR-10 dataset. Future work may explore these directions. ACKNOWLEDGMENTS Wei Wang is fully supported by the Ph.D. scholarships of The University of Melbourne. This work is partially funded by Australian Research Council grant DP150103512 and undertaken using the LIEF HPC-GPGPU Facility hosted at the University of Melbourne. The Facility was established with the assistance of LIEF Grant LE170100200. Michael Arbel, Dougal J. Sutherland, Mikołaj Bi nkowski, and Arthur Gretton. On gradient regularizers for MMD GANs. In NIPS, 2018. Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In ICML, volume 70 of PMLR, pp. 214 223, 2017. Mikołaj Bi nkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In ICLR, 2018. Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, volume 15 of PMLR, pp. 215 223, 2011. Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning, 2016. arxiv:1603.07285. Gintare Karolina Dziugaite, Daniel M. Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In UAI, pp. 258 267, 2015. Marc G. Genton. Classes of kernels for machine learning: A statistics perspective. J. Mach. Learn. Res., 2:299 312, 2002. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pp. 2672 2680, 2014. Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. J. Mach. Learn. Res., 13:723 773, 2012. Guillermo L. Grinblat, Lucas C. Uzal, and Pablo M. Granitto. Class-splitting generative adversarial networks, 2017. ar Xiv:1709.07359. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of Wasserstein GANs. In NIPS, pp. 5767 5777, 2017. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pp. 770 778, 2016. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NIPS, pp. 6626 6637, 2017. Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NIPS, pp. 4565 4573, 2016. Xun Huang, Yixuan Li, Omid Poursaeed, John E. Hopcroft, and Serge J. Belongie. Stacked generative adversarial networks. CVPR, pp. 1866 1875, 2017. Published as a conference paper at ICLR 2019 Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448 456, 2015. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018. Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In ICLR, 2015. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Master s thesis, Department of Computer Science, University of Toronto, 2009. Wei-Sheng Lai, Jia-Bin Huang, and Ming-Hsuan Yang. Semi-supervised learning for optical flow with generative adversarial networks. In NIPS, pp. 354 364, 2017. Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabas Poczos. MMD GAN: Towards deeper understanding of moment matching network. In NIPS, pp. 2203 2213, 2017a. Jiwei Li, Will Monroe, Tianlin Shi, S ebastien Jean, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue generation. In EMNLP, pp. 2157 2169, 2017b. Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In ICML, volume 37, pp. 1718 1727, 2015. Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of generative adversarial learning. In NIPS, pp. 5545 5553, 2017. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In ICCV, pp. 3730 3738, 2015. Takeru Miyato and Masanori Koyama. c GANs with projection discriminator. In ICLR, 2018. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018. Vaishnavh Nagarajan and J. Zico Kolter. Gradient descent GAN optimization is locally stable. In NIPS, pp. 5585 5595, 2017. Xuan Long Nguyen, Martin J. Wainwright, and Michael I. Jordan. On surrogate loss functions and f-divergences. Ann. Stat., 37(2):876 904, 04 2009. Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier GANs. In ICML, pp. 2642 2651, 2017. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In NIPS, pp. 2234 2242, 2016. Hanie Sedghi, Vineet Gupta, and Philip M. Long. The singular values of convolutional layers. In ICLR, 2019. Casper K. Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz ar. Amortised map inference for image super-resolution. In ICLR, 2017. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, pp. 2818 2826, 2016. Dustin Tran, Rajesh Ranganath, and David M. Blei. Hierarchical implicit models and likelihood-free variational inference, 2017. ar Xiv:1702.08896. Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitz-margin training: scalable certification of perturbation invariance for deep neural networks. In NIPS, 2018. Published as a conference paper at ICLR 2019 Thomas Unterthiner, Bernhard Nessler, Calvin Seward, G unter Klambauer, Martin Heusel, Hubert Ramsauer, and Sepp Hochreiter. Coulomb GANs: Provably optimal Nash equilibria via potential fields. In ICLR, 2018. Laurens van der Maaten. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res., 15: 3221 3245, 2014. Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In NIPS, 2018. Zhou Wang, Eero. P. Simoncelli, and Alan. C. Bovik. Multiscale structural similarity for image quality assessment. In Asilomar Conference on Signals, Systems & Computers, volume 2, pp. 1398 1402, 2003. Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. 2015. ar Xiv:1506.03365. Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks, 2018. ar Xiv:1805.08318. Zhiming Zhou, Han Cai, Shu Rong, Yuxuan Song, Kan Ren, Weinan Zhang, Jun Wang, and Yong Yu. Activation maximization generative adversarial nets. In ICLR, 2018. Jun-Yan. Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, pp. 2242 2251, 2017. Published as a conference paper at ICLR 2019 A STABILITY ANALYSIS OF MMD-GAN This section demonstrates that, under mild assumptions, MMD-GAN trained by gradient descent is locally exponentially stable at equilibrium. It is organized as follows. The main assumption and proposition are presented in Section A.1, followed by simulation study in Section A.2 and proof in Section A.3. We discuss the indications of assumptions on the discriminator of GAN in Section A.4. A.1 MAIN PROPOSITION We consider GAN trained using the MMD loss Lmmd G for generator G and either the attractive loss Latt D or repulsive loss Lrep D for discriminator D, listed below: Lmmd G = M 2 k D(PX, PG) = EPX[k D(x, x )] 2EPX,PG[k D(x, y)] + EPG[k D(y, y )] (S1a) Latt D = Lmmd G (S1b) Lrep D = EPX[k D(x, x )] EPG[k D(y, y )] (S1c) where k D(a, b) = k(D(a), D(b)). Let S(P) be the support of distribution P; let θG ΘG, θD ΘD be the parameters of the generator G and discriminator D respectively. To prove that GANs trained using the minimax loss and gradient updates is locally stable at the equilibrium point (θ D, θ G), Nagarajan & Kolter (2017) made the following assumption: Assumption 1 (Nagarajan & Kolter (2017)). Pθ G = PX and x S(PX), Dθ D(x) = 0. For loss functions like minimax and Wasserstein, DθD(x) may be interpreted as how plausible a sample is real. Thus at equilibrium, it may be reasonable to assume all real and generated samples are equally plausible. However, Dθ D(x) = 0 also indicates that Dθ D may have no discrimination power (see Appendix A.4 for discussion). For MMD loss in Eq. S1, DθD(x)|x P may be interpreted as a learned representation of the distribution P. As long as two distributions P and Q match, M 2 k DθD (P, Q) = 0. On the other hand, DθD(x) = 0 is a minima solution for D but D is trained to find local maxima. Thus in contrast to Assumption 1, we assume Assumption 2. For GANs using MMD loss in Eq. S1, and random initialization on parameters, at equilibrium, Dθ D(x) is injective on S(PX) S S(Pθ G). Assumption 2 indicates that Dθ D(x) is not constant almost everywhere. We use a simulation study in Section A.2 to show that Dθ D(x) = 0 does not hold in general for MMD loss. Based on Assumption 2, we propose the following proposition and prove it in Appendix A.3: Proposition 1. If there exists θ G ΘG such that Pθ G = PX, then GANs with MMD loss in Eq. S1 has equilibria (θ G, θD) for any θD ΘD. Moreover, the model trained using gradient descent methods is locally exponentially stable at (θ G, θD) for any θD ΘD. There may exist non-realizable cases where the mapping between PZ and PX cannot be represented by any generator GθG with θG ΘG. In Section A.2, we use a simulation study to show that both the attractive MMD loss Latt D (Eq. S1b) and the proposed repulsive loss Lrep D (Eq. S1c) may be locally stable and leave the proof for future work. A.2 SIMULATION STUDY In this section, we reused the example from Nagarajan & Kolter (2017) to show that GAN trained using the MMD loss in Eq. S1 is locally stable. Consider a two-parameter MMD-GAN with uniform latent distribution PZ over [ 1, 1], generator G(z) = w1z, discriminator D(x) = w2x2, and Gaussian kernel krbf 0.5. The MMD-rbf model (Lmmd G and Latt D from Eq. S1b) and the MMD-rep model (Lmmd G and Lrep D from Eq. S1c) were tested. Each model was applied to two cases: (a) the data distribution PX is the same as PZ, i.e., uniform over [ 1, 1], thus PX is realizable; Published as a conference paper at ICLR 2019 (a) MMD-rbf, PX = U( 1, 1) (b) MMD-rep, PX = U( 1, 1) (c) MMD-rbf, PX = N(0, 1) (d) MMD-rep, PX = N(0, 1) Figure S1: Streamline plots of MMD-GAN using the MMD-rbf and the MMD-rep model on distributions: PZ = U( 1, 1), PX = U( 1, 1) or PX = N(0, 1). In (a) and (b), the equilibria satisfying PG = PX lie on the line w1 = 1. In (c), the equilibrium lies around point (1.55, 0.74); in (d), it is around (1.55, 0.32). (b) PX is standard Gaussian, thus non-realizable for any w1 R. Fig. S1 shows that MMD-GAN are locally stable in both cases and Dθ D(x) = 0 does not hold in general for MMD loss. However, MMD-rep may not be globally stable for the tested cases: initialization of (w1, w2) in some regions may lead to the trivial solution w2 = 0 (see Fig. S1b and S1d). We note that by decreasing the learning rate for G, the area of such regions decreased. At last, it is interesting to note that both MMD-rbf and MMD-rep had the same nontrivial solution w1 1.55 for generator in the non-realizable cases (see Fig. S1c and S1d). A.3 PROOF OF PROPOSITION 1 This section divides the proof for Proposition 1 into two parts. First, we show that GAN with the MMD loss in Eq. S1 has equilibria for any parameter configuration of discriminator D; second, we prove the model is locally exponentially stable. For convenience, we consider the general form of discriminator loss in Eq. 5: LD,λ = λEPX[k D(x, x )] (λ 1)EPX,PG[k D(x, y)] EPG[k D(y, y )] (S2) which has Latt D and Lrep D as the special cases when λ equals 1 and 1 respectively. Consider real data Xr PX, latent variable Z PZ and generated variable Yg = GθG(Z). Let xr, z, yg be their samples. Denote a b = a b ; . θD = LD θD , . θG = LG θG ; dg = D(G(z)), dr = D(xr) where LD and LG are the losses for D and G respectively. Assume an isotropic stationary kernel k(a, b) = k I( a b ) (Genton (2002)) is used in MMD. We first show: Proposition 1 (Part 1). If there exists θ G ΘG such that Pθ G = PX, the GAN with the MMD loss in Eq. S1a and Eq. S2 has equilibria (θ G, θD) for any θD ΘD. Published as a conference paper at ICLR 2019 Proof. Denote ei,j = ai bj and k ei,j = k(ai,bj) e where k is the kernel of MMD. The gradients of MMD loss are . θD = (λ 1)EPX,PθG [ k er,g er,g θD ] λEPX[ k er1,r2 er1,r2 θD ] + EPθG [ k eg1,g2 eg1,g2 θD ] (S3a) . θG = 2EPθG,PX[ k eg,r dg xg xg θG] EPθG [ k eg1,g2( dg1 xg1 xg1 θG dg2 xg2 xg2 θG )] (S3b) Note that, given i.i.d. drawn samples X = {xi}n i=1 PX and Y = {yi}n i=1 PG, an unbiased estimator of the squared MMD is (Gretton et al. (2012)) ˆ M 2 k(PX, PG) = 1 n(n 1) i =j k(xi, xj)+ 1 n(n 1) i =j k(yi, yj) 2 n(n 1) i =j k(xi, yj) (S4) At equilibrium, consider a sequence of N samples dri = dgi = di with N , we have . θD (λ 1) X i =j k ei,j ei,j θD λ X i =j k ei,j ei,j θD + X i =j k ei,j ei,j θD = 0 i =j k ei,j( di xi xi θG dj xj xj θG) + 2 X i =j k ei,j di xi xi θG i =j k ei,j( di xi xi θG + dj xj xj θG) = 0 where for . θ G we have used the fact that for each term in the summation, there exists an term with i, j reversed and k ei,j = k ej,i thus the summation is zero. Since we have not assumed the status θD = 0 for any θD ΘD. We proceed to prove the model stability. First, following Theorem 5 in Gretton et al. (2012) and Theorem 4 in Li et al. (2017a), it is straightforward to see: Lemma A.1. Under Assumption 2, M 2 k DθD (PX, PθG) 0 with the equality if and only if PX = PθG. Lemma A.1 and Proposition 1 (Part 1) state that at equilibrium Pθ G = PX, every discriminator DθD and kernel k will give M 2 k DθD (Pθ G, PX) = 0, thus no discriminator can distinguish the two distributions. On the other hand, we cite Theorem A.4 from Nagarajan & Kolter (2017): Lemma A.2 (Nagarajan & Kolter (2017)). Consider a non-linear system of parameters (θ, γ): θ = h1(θ, γ), γ = h2(θ, γ) with an equilibrium point at (0, 0). Let there exist ϵ such that γ Bϵ(0), (0, γ) is an equilibrium. If J = h1(θ,γ) θ (0,0) is a Hurwitz matrix, the non-linear system is exponentially stable. Now we can prove: Proposition 1 (Part 2). At equilibrium Pθ G = PX, the GAN trained using MMD loss and gradient descent methods is locally exponentially stable at (θ G, θD) for any θD ΘD. Proof. Inspired by Nagarajan & Kolter (2017), we first derive the Jacobian of the system J = JDD JDG JGD JGG Published as a conference paper at ICLR 2019 Denote a b = 2a b2 and a bc = 2a b c. Based on Eq. S3, we have JDD =(λ 1)EPX,PθG [( er,g θD )T ( k er,g)T + ( er,g θD )T k er,g er,g θD ] (S5a) λEPX[( er1,r2 θD )T ( k er1,r2)T + ( er1,r2 θD )T k er1,r2 er1,r2 θD ] + EPθG [( eg1,g2 θD )T ( k eg1,g2)T + ( eg1,g2 θD )T k eg1,g2 eg1,g2 θD ] JDG =(λ 1)EPX,PθG[( er,g θDθG)T ( k er,g)T ( er,g θD )T k er,g dg θG] + EPθG [( eg1,g2 θDθG)T ( k eg1,g2)T + ( eg1,g2 θD )T k eg1,g2 eg1,g2 θG ] (S5b) JGD =2EPX,PθG [( eg,r θGθD)T ( k eg,r)T + ( eg,r θG )T k eg,r dg θD] EPθG [( eg1,g2 θGθD )T ( k eg1,g2)T + ( eg1,g2 θG )T k eg1,g2 eg1,g2 θD ] (S5c) JGG = EPθG [( eg1,g2 θG )T ( k eg1,g2)T + ( eg1,g2 θG )T k eg1,g2 eg1,g2 θG ] + 2EPX,PθG [( dg θG)T ( k er,g)T + ( dg θG)T k er,g dg θG] (S5d) where is the kronecker product. At equilibrium, consider a sequence of N samples dri = dgi = di with N , we have JDD = 0, JGD = 0 and JDG (λ + 1) X i 17. Based on Assumption 3, we have the following proposition: Proposition 2. If x S, D(x) = c, where c is constant, then there always exists distortion δx such that x + δx S and D(x + δx) = c. 6This include many commonly used activations like linear, sigmoid, tanh, Re LU and ELU. 7For distributions with semi-infinite or infinite support, we consider the effective or truncated support Sϵ(P) = {x X|P(x) ϵ}, where ϵ > 0 is a small scalar. This is practical, e.g., univariate Gaussian has support in ( , + ) yet a sample five standard deviations away from the mean is unlikely to be valid. Published as a conference paper at ICLR 2019 Proof. Without loss of generality, we consider D(x) = W2h(x) + b2 and h(x) = f(W1x + b1), where W1 Rdh d, W2, b1, b2 are model weights and biases, f is an activation function satisfying Assumption 3. For x S, since D(x) = c, we have h(x) null(W2). Furthermore: (a) If rank(W1) < d, for any δx null(W1), h(x + δx) null(W2). (b) If rank(W1) = d = dh, the problem h(x + δx) = k h(x) has unique solution for any k R as long as k h(x) is within the output range of f. (c) If rank(W1) = d < dh, let U and V be two basis matrices of Rdh such that W1x = U ˆx T 0T T and any vector in null(W2) can be represented as V z T 0T T , where ˆx Rdh d, z Rdh n and n is the nullity of W2. Let the projected support be ˆS. Thus, ˆx ˆS, there exists z such that f(U ˆx T 0T T + b1) = V z T z T c T with zc = 0. Consider the Jacobian: J = z T z T c T ˆx T 0T T = V 1 ΣU (S6) where Σ = diag( d f d ai ) and a = [ai]dh i=1 is the input to activation, or pre-activations. Since ˆS is continuous and compact, it has infinite number of boundary points {ˆxb} for d > 1. Consider one boundary point ˆxb and its normal line δˆxb. Let ϵ > 0 be a small scalar such that ˆxb ϵδˆxb ˆS and ˆxb + ϵδˆxb ˆS. For linear activation, Σ = I and J is constant. Then zc remains 0 for ˆxb + ϵδˆxb, i.e., there exists z such that h(ˆx + ϵδˆx) null(W2). For nonlinear activations, assume f has N discontinuities. Since U ˆx T 0T T + b1 = c has unique solution for any vector c, the boundary points {ˆxb} cannot yield pre-activations {ab} that all lie on the discontinuities in any of the dh directions. Though we might need to sample d N+1 h points in the worst case to find an exception, there are infinite number of exceptions. Let ˆxb be a sample where {ab} does not lie on the discontinuities in any direction. Because f is continuous, zc remains 0 for ˆxb + ϵδˆxb, i.e., there exists z such that h(ˆx + ϵδˆx) null(W2). In conclusion, we can always find δx such that x + δx / S and D(x + δx) = c. Proposition 2 indicates that if Dθ D(x) = 0 for x S(PX) S S(Pθ G), Dθ D cannot discriminate against fake samples with distortions to the original data. In contrast, Assumption 2 and Lemma A.1 guarantee that, at equilibrium, the discriminator trained using MMD loss function is effective against such fake samples given a large number of i.i.d. test samples (Gretton et al. (2012)). B SUPPLEMENTARY METHODOLOGY B.1 REPRESENTATIVE LOSS FUNCTIONS IN LITERATURE Several loss functions have been proposed to quantify the difference between real and generated sample scores, including: (assume linear activation is used at the last layer of D) The Minimax loss (Goodfellow et al. (2014)): LD = EPX[Softplus( D(x))] + EPZ[Softplus(D(G(z)))] and LG = LD, which can be derived from the Jensen Shannon (JS) divergence between PX and the model distribution PG. The non-saturating loss (Goodfellow et al. (2014)), which is a variant of the minimax loss with the same LD and LG = EPZ[Softplus( D(G(z)))]. The Hinge loss (Tran et al. (2017)): LD = EPX[Re LU(1 D(x))]+EPZ[Re LU(1+D(G(z)))], LG = EPZ[ D(G(z))], which is notably known for usage in support vector machines and is related to the total variation (TV) distance (Nguyen et al. (2009)). Published as a conference paper at ICLR 2019 The Wasserstein loss (Arjovsky et al. (2017); Gulrajani et al. (2017)), which is derived from the Wasserstein distance between PX and PG: LG = EPZ[D(G(z))], LD = EPZ[D(G(z))] EPX[D(x)], where D is subject to some Lipschitz constraint. The maximum mean discrepancy (MMD) (Li et al. (2017a); Bi nkowski et al. (2018)), as described in Section 2. B.2 NETWORK ARCHITECTURE For unsupervised image generation tasks on CIFAR-10 and STL-10 datasets, the DCGAN architecture from Miyato et al. (2018) was used. For Celeb A and LSUN bedroom datasets, we added more layers to the generator and discriminator accordingly. See Table S1 and S2 for details. Table S1: DCGAN models for image generation on CIFAR-10 (h = w = 4, H = W = 32) and STL-10 (h = w = 6, H = W = 48) datasets. For non-saturating loss and hinge loss, s = 1; for MMD-rand, MMD-rbf, MMD-rq, s = 16. (a) Generator z R128 N(0, I) 128 h w 512, dense, linear 4 4, stride 2 deconv, 256, BN, Re LU 4 4, stride 2 deconv, 128, BN, Re LU 4 4, stride 2 deconv, 64, BN, Re LU 3 3, stride 1 conv, 3, Tanh (b) Discriminator RGB image x [ 1, 1]H W 3 3 3, stride 1 conv, 64, LRe LU 4 4, stride 2 conv, 128, LRe LU 3 3, stride 1 conv, 128, LRe LU 4 4, stride 2 conv, 256, LRe LU 3 3, stride 1 conv, 256, LRe LU 4 4, stride 2 conv, 512, LRe LU 3 3, stride 1 conv, 512, LRe LU h w 512 s, dense, linear Table S2: DCGAN models for image generation on Celeb A and LSUN-bedroom datasets. For non-saturating loss and hinge loss, s = 1; for MMD-rand, MMD-rbf, MMD-rq, s = 16. (a) Generator z R128 N(0, I) 128 4 4 1024, dense, linear 4 4, stride 2 deconv, 512, BN, Re LU 4 4, stride 2 deconv, 256, BN, Re LU 4 4, stride 2 deconv, 128, BN, Re LU 4 4, stride 2 deconv, 64, BN, Re LU 3 3, stride 1 conv, 3, Tanh (b) Discriminator RGB image x [ 1, 1]64 64 3 3 3, stride 1 conv, 64, LRe LU 4 4, stride 2 conv, 128, LRe LU 3 3, stride 1 conv, 128, LRe LU 4 4, stride 2 conv, 256, LRe LU 3 3, stride 1 conv, 256, LRe LU 4 4, stride 2 conv, 512, LRe LU 3 3, stride 1 conv, 512, LRe LU 4 4, stride 2 conv, 1024, LRe LU 3 3, stride 1 conv, 1024, LRe LU 4 4 512 s, dense, linear Published as a conference paper at ICLR 2019 C POWER ITERATION FOR CONVOLUTION OPERATION This section introduces the power iteration for convolution operation (PICO) method to estimate the spectral norm of a convolution kernel, and compare PICO with the power iteration for matrix (PIM) method used in Miyato et al. (2018). C.1 METHOD FORMATION For a weight matrix W , the spectral norm is defined as σ(W ) = max v 2 1 W v 2. The PIM is used to estimate σ(W ) (Miyato et al. (2018)), which iterates between two steps: 1. Update u = W v/ W v 2; 2. Update v = W T u/ W T u 2. The convolutional kernel Wc is a tensor of shape h w cin cout with h, w the receptive field size and cin, cout the number of input/output channels. To estimate σ(Wc), Miyato et al. (2018) reshaped it into a matrix Wrs of shape (hwcin) cout and estimated σ(Wrs). We propose a simple method to calculate Wc directly based on the fact that convolution operation is linear. For any linear map T : Rm Rn, there exists matrix WL Rn m such that y = T(x) can be represented as y = WLx. Thus, we may simply substitute WL = y x in the PIM method to estimate the spectral norm of any linear operation. In the case of convolution operation , there exist doubly block circulant matrix Wdbc such that u = Wc v = Wdbcv. Consider v = W T dbcu = [ u v ]T u which is essentially the transpose convolution of Wc on u (Dumoulin & Visin (2016)). Thus, similar to PIM, PICO iterates between the following two steps: 1. Update u = Wc v/ Wc v 2; 2. Do transpose convolution of Wc on u to get ˆv; update v = ˆv/ ˆv 2. Similar approaches have been proposed in Tsuzuku et al. (2018) and Virmaux & Scaman (2018) from different angles, which we were not aware during this study. In addition, Sedghi et al. (2019) proposes to compute the exact singular values of convolution kernels using FFT and SVD. In spectral normalization, only the first singular value is concerned, making the power iteration methods PIM and PICO more efficient than FFT and thus preferred in our study. However, we believe the exact method FFT+SVD (Sedghi et al. (2019)) may eventually inspire more rigorous regularization methods for GAN. The proposed PICO method estimates the real spectral norm of a convolution kernel at each layer, thus enforces an upper bound on the Lipschitz constant of the discriminator D. Denote the upper bound as LIPPICO. In this study, Leaky Re LU (LRe LU) was used at each layer of D, thus LIPPICO 1 (Virmaux & Scaman (2018)). In practice, however, PICO would often cause the norm of the signal passing through D to decrease to zero, because at each layer, the signal hardly coincides with the first singular-vector of the convolution kernel; and the activation function LRe LU often reduces the norm of the signal. Consequently, the discriminator outputs tend to be similar for all the inputs. To compensate the loss of norm at each layer, the signal is multiplied by a constant C after each spectral normalization. This essentially enlarges LIPPICO by CK where K is the number of layers in the DCGAN discriminator. For all experiments in Section 5, we fixed C = 1 0.55 1.82 as all loss functions performed relatively well empirically. In Appendix Section C.3, we tested the effects of coefficient CK on the performance of several loss functions. C.2 COMPARISON TO PIM PIM (Miyato et al. (2018)) also enforces an upper bound LIPPIM on the Lipschitz constant of the discriminator D. Consider a convolution kernel Wc with receptive field size h w and stride s. Let σPICO and σPIM be the spectral norm estimated by PICO and PIM respectively. We empirically Published as a conference paper at ICLR 2019 Table S3: Fr echet Inception distance (FID) on image generation tasks using spectral normalization with two power iteration methods PICO and PIM Methods CK in PICO CIFAR-10 STL-10 Celeb A LSUN-bedrom PIM PICO PIM PICO PIM PICO PIM PICO Hinge 128 23.60 22.89 47.10 47.24 10.02 9.08 27.38 17.20 MMD-rbf 128 26.56 26.50 53.17 54.23 13.06 12.81 MMD-rep 64 19.98 17.00 40.40 37.15 8.51 6.81 74.03 16.01 MMD-rep-b 64 18.24 16.65 39.78 37.31 7.09 6.42 20.12 11.22 found8 that σ 1 PIM varies in the range [σ 1 PICO, hw s σ 1 PICO], depending on the kernel Wc. For a typical kernel of size 3 3 and stride 1, σ 1 PIM may vary from σ 1 PICO to 3σ 1 PICO. Thus, LIPPIM is indefinite and may vary during training. In deep convolutional networks, PIM could potentially result in a very loose constraint on the Lipschitz constant of the network. In Appendix Section C.3, we experimentally compare the performance of PICO and PIM with several loss functions. C.3 EXPERIMENTS In this section, we empirically evaluate the effects of coefficient CK on the performance of PICO and compare PICO against PIM using several loss functions. Experiment setup: We used a similar setup as Section 5.1 with the following adjustments. Four loss functions were tested: hinge, MMD-rbf, MMD-rep and MMD-rep-b. Either PICO or PIM was used at each layer of the discriminator. For PICO, five coefficients CK were tested: 16, 32, 64, 128 and 256 (note this is the overall coefficient for K layers; K = 8 for CIFAR-10 and STL-10; K = 10 for Celeb A and LSUN-bedroom; see Appendix B.2). FID was used to evaluate the performance of each combination of loss function and power iteration method, e.g., hinge + PICO with CK = 16. Results: For each combination of loss function and power iteration method, the distribution of FID scores over 16 learning rate combinations is shown in Fig. S2. We separated well-performed learning rate combinations from diverged or poorly-performed ones using a threshold τ as the diverged cases often had non-meaningful FID scores. The boxplot shows the distribution of FID scores for goodperformed cases while the number of diverged or poorly-performed cases was shown above each box if it is non-zero. Fig. S2 shows that: 1) When PICO was used, the hinge, MMD-rbf and MMD-rep methods were sensitive to the choices of CK while MMD-rep-b was robust. For hinge and MMD-rbf, higher CK may result in better FID scores and less diverged cases over 16 learning rate combinations. For MMD-rep, higher CK may cause more diverged cases; however, the best FID scores were often achieved with CK = 64 or 128. 2) For CIFAR-10, STL-10 and Celeb A datasets, PIM performed comparable to PICO with CK = 128 or 256 on four loss functions. For LSUN bedroom dataset, it is likely that the performance of PIM corresponded to that of PICO with CK > 256. This implies that PIM may result in a relatively loose Lipschitz constraint on deep convolutional networks. 3) MMD-rep-b performed generally better than hinge and MMD-rbf with tested power iteration methods and hyper-parameter configurations. Using PICO, MMD-rep also achieved generally better FID scores than hinge and MMD-rbf. This implies that, given a limited computational budget, the proposed repulsive loss may be a better choice than the hinge and MMD loss for the discriminator. 8This was obtained by optimizing σPICO/σPIM w.r.t. a variety of randomly initialized kernel Wc. Both gradient descent and Adam methods were tested with a small learning rate 1e 5 so that the error of spectral norm estimation may be ignored at each iteration. Published as a conference paper at ICLR 2019 Table S4: Inception score (IS) and Fr echet Inception distance (FID) on CIFAR-10 dataset using gradient penalty and different loss functions Methods1 IS FID Real data 11.31 2.09 SMMDGAN2 7.0 31.5 SN-SMMDGAN2 7.3 25.0 MMD-rep-gp 7.26 23.01 1 All methods used the same DCGAN architecture. 2 Results from Arbel et al. (2018) Table 1. Table S5: Inception score (IS), Fr echet Inception distance (FID) on CIFAR-10 dataset using MMD-rep and different dimensions of discriminator outputs Methods CK IS FID Real data 11.31 2.09 MMD-rep-1 64 7.43 22.43 MMD-rep-4 64 7.81 17.87 MMD-rep-16 32 8.20 16.99 MMD-rep-64 32 8.08 15.65 MMD-rep-256 32 7.96 16.61 Table S3 shows the best FID scores obtained by PICO and PIM where CK was fixed at 128 for hinge and MMD-rbf, and 64 for MMD-rep and MMD-rep-b. For hinge and MMD-rbf, PICO performed significantly better than PIM on the LSUN-bedroom dataset and comparably on the rest datasets. For MMD-rep and MMD-rep-b, PICO achieved consistently better FID scores than PIM. However, compared to PIM, PICO has a higher computational cost which roughly equals the additional cost incurred by increasing the batch size by two (Tsuzuku et al. (2018)). This may be problematic when a small batch has to be used due to memory constraints, e.g., when handling high resolution images on a single GPU. Thus, we recommend using PICO when the computational cost is less of a concern. D SUPPLEMENTARY EXPERIMENTS D.1 LIPSCHITZ CONSTRAINT VIA GRADIENT PENALTY Gradient penalty has been widely used to impose the Lipschitz constraint on the discriminator arguably since Wasserstein GAN (Gulrajani et al. (2017)). This section explores whether the proposed repulsive loss can be applied with gradient penalty. Several gradient penalty methods have been proposed for MMD-GAN. Bi nkowski et al. (2018) penalized the gradient norm of witness function fw(z) = EPX[k D(z, x)] EPG[k D(z, y)] w.r.t. the interpolated sample z = ux + (1 u)y to one, where u U(0, 1)9. More recently, Arbel et al. (2018) proposed to impose the Lipschitz constraint on the mapping φ D directly and derived the Scaled MMD (SMMD) as SMk(P, Q) = σµ,k,λMk(P, Q), where the scale σµ,k,λ incorporates gradient and smooth penalties. Using the Gaussian kernel and measure µ = PX leads to the discriminator loss: LSMMD D = Latt D 1 + λEPX[ D(x) 2 F ] (S7) We apply the same formation of gradient penalty to the repulsive loss: Lrep-gp D = Lrep D 1 1 + λEPX[ D(x) 2 F ] (S8) where the numerator Lrep D 1 0 so that the discriminator will always attempt to minimize both Lrep D and the Frobenius norm of gradients D(x) w.r.t. real samples. Meanwhile, the generator is trained using the MMD loss Lmmd G (Eq. 2). Experiment setup: The gradient-penalized repulsive loss Lrep-gp D (Eq. S8, referred to as MMD-repgp) was evaluated on the CIFAR-10 dataset. We found λ = 10 in Arbel et al. (2018) too restrictive 9Empirically, we found this gradient penalty did not work with the repulsive loss. The reason may be the attractive loss Latt D (Eq. 3) is symmetric in the sense that swapping PX and PG results in the same loss; while the repulsive loss is asymmetric and naturally results in varying gradient norms in data space. Published as a conference paper at ICLR 2019 and used λ = 0.1 instead. Same as Arbel et al. (2018), the output dimension of discriminator was set to one. Since we entrusted the Lipschitz constraint to the gradient penalty, spectral normalization was not used. The rest experiment setup can be found in Section 5.1. Results: Table S4 shows that the proposed repulsive loss can be used with gradient penalty to achieve reasonable results on CIFAR-10 dataset. For comparison, we cited the Inception score and FID for Scaled MMD-GAN (SMMDGAN) and Scaled MMD-GAN with spectral normalization (SN-SMMDGAN) from Arbel et al. (2018). Note that SMMDGAN and SN-SMMDGAN used the same DCGAN architecture as MMD-rep-gp, but were trained for 150k generator updates and 750k discriminator updates, much more than that of MMD-rep-gp (100k for both G and D). Thus, the repulsive loss significantly improved over the attractive MMD loss for discriminator. D.2 OUTPUT DIMENSION OF DISCRIMINATOR In this section, we investigate the impact of the output dimension of discriminator on the performance of repulsive loss. Experiment setup: We used a similar setup as Section 5.1 with the following adjustments. The repulsive loss was tested on the CIFAR-10 dataset with a variety of discriminator output dimensions: d {1, 4, 16, 64, 256}. Spectral normalization was applied to discriminator with the proposed PICO method (see Appendix C) and the coefficients CK selected from {16, 32, 64, 128, 256}. Results: Table S5 shows that using more than one output neuron in the discriminator D significantly improved the performance of repulsive loss over the one-neuron case on CIFAR-10 dataset. The reason may be that using insufficient output neurons makes it harder for the discriminator to learn an injective and discriminative representation of the data (see Fig. 4b). However, the performance gain diminished when more neurons were used, perhaps because it becomes easier for D to surpass the generator G and trap it around saddle solutions. The computation cost also slightly increased due to more output neurons. D.3 SAMPLES OF UNSUPERVISED IMAGE GENERATION Generated samples on Celeb A dataset are given in Fig. S3 and LSUN bedrooms in Fig. S4. Published as a conference paper at ICLR 2019 CK = 16 CK = 32 CK = 64 CK = 128 CK = 256 PIM 11 15 5 14 2 5 5 1 4 3 1 6 Hinge MMD-rbf MMD-rep MMD-rep-b (a) CIFAR-10 CK = 16 CK = 32 CK = 64 CK = 128 CK = 256 PIM 40 50 60 70 80 6 16 4 2 10 4 2 9 8 8 11 Hinge MMD-rbf MMD-rep MMD-rep-b CK = 16 CK = 32 CK = 64 CK = 128 CK = 256 PIM 10 20 30 40 50 14 13 5 7 11 4 2 2 7 1 1 9 1 14 1 10 2 Hinge MMD-rbf MMD-rep MMD-rep-b (c) Celeb A CK = 16 CK = 32 CK = 64 CK = 128 CK = 256 PIM 14 16 5 1 9 16 6 6 16 11 3 16 12 2 16 12 2 4 16 15 1 Hinge MMD-rbf MMD-rep MMD-rep-b (d) LSUN-bedroom Figure S2: Boxplot of the FID scores for 16 learning rate combinations on four datasets: (a) CIFAR10, (b) STL-10, (c) Celeb A, (d) LSUN-bedroom, using four loss functions, Hinge, MMD-rbf, MMD-rep and MMD-rmb. Spectral normalization was applied to discriminator with two power iteration methods: PICO and PIM. For PICO, five coefficients CK were tested: 16, 32, 64, 128, and 256. A learning rate combination was considered diverged or poorly-performed if the FID score exceeded a threshold τ, which is 50, 80, 50, 90 for CIFAR-10, STL-10, Celeb A and LSUN-bedroom respectively. The box quartiles were plotted based on the cases with FID < τ while the number of diverged or poorly-performed cases (out of 16 learning rate combinations) was shown above each box if it is non-zero. We introduced τ because the diverged cases often had arbitrarily large and non-meaningful FID scores. Published as a conference paper at ICLR 2019 (a) Real samples (c) MMD-rbf (d) MMD-rbf-b (e) MMD-rep (f) MMD-rep-b Figure S3: Image generation using different loss functions on 64 64 Celeb A dataset. Published as a conference paper at ICLR 2019 (a) Real samples (b) Non-saturating (d) MMD-rbf-b (e) MMD-rep (f) MMD-rep-b Figure S4: Image generation using different loss functions on 64 64 LSUN bedroom dataset.