# distributional_concavity_regularization_for_gans__152f7470.pdf

Published as a conference paper at ICLR 2019

DISTRIBUTIONAL CONCAVITY REGULARIZATION FOR GANS

Shoichiro Yamaguchi, Masanori Koyama Preferred Networks {guguchi, masomatics}@preferred.jp

We propose Distributional Concavity (DC) regularization for Generative Adversarial Networks (GANs), a functional gradient-based method that promotes the entropy of the generator distribution and works against mode collapse. Our DC regularization is an easy-to-implement method that can be used in combination with the current state of the art methods like Spectral Normalization and Wasserstein GAN with gradient penalty to further improve the performance. We will not only show that our DC regularization can achieve highly competitive results on ILSVRC2012 and CIFAR datasets in terms of Inception score and Fr echet inception distance, but also provide a mathematical guarantee that our method can always increase the entropy of the generator distribution. We will also show an intimate theoretical connection between our method and the theory of optimal transport.

1 INTRODUCTION

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) is a model consisting of two adversarial neural networks designed for the training of a generator distribution that mimics a target distribution deﬁned on a (often) high dimensional space, and it has been successful in numerous applications including image and movie generations (Isola et al., 2017; Zhu et al., 2017; Saito et al., 2017). However, it has not yet completely established itself as a wildly scientiﬁc tool because of the sheer computational difﬁculty of its training process. As such, there has been numerous studies seeking for a way to stabilize the training process of GANs (Gulrajani et al., 2017; Miyato et al., 2018; Karras et al., 2018).

Mode collapse is a persistent central problem for the training of GANs, which collectively refers to the lack of diversity in generator distribution. For instance, without any countermeasure, GANs applied to multimodal mixture of Gaussians will often train a generator distribution with one mode (See Fig 22, Goodfellow (2016) for example). Mode collapse has been discussed in numerous literatures related GANs. To name a few, see Goodfellow (2016); Metz et al. (2016); Arjovsky & Bottou (2017); Arjovsky et al. (2017); Lin et al. (2017).

In words of statistical machine learning, mode collapse can be described as a case of entropy degeneration. A naive countermeasure against mode collapse is therefore to augment the entropy of the generator distribution. Not many studies to date, however, tackled the problem of mode collapse using this direct approach. Dai et al. (2017) partially realized this idea by actually evaluating the entropy of the generator distribution with variational inference and including it into the objective function. This type of strategy requires the user to continuously produce reliable estimates of the generators based on ﬁnite samples throughout the course of the training. As such, not much further improvement can be expected from this strategy, because precise estimation of the entropy tends to be computationally heavy, and empirical estimation is an excruciatingly difﬁcult problem on its own in high dimension, as is the case in image and movie generation. In fact, the performance of classical methods based on kernel density estimation, for example, does not scale well with respect to dimension, and the number of samples required to control the MSE can grow exponentially with the dimension (Cacoullos, 1966; Ozakin & Gray, 2009).

In this study, we use the theory of functional gradient to develop a method that can promote the entropy of the generator distribution without directly estimating the entropy itself. All in all, the ob-

Published as a conference paper at ICLR 2019

jective of GANs is ﬁnd a good generator. The motive of functional gradient method is pay attention to the update of the generator in the space of generator functions itself, instead of the updates in the parameter space. In the function space, the functional properties of the generators are generally easier expressed than in the parameter space. As such, the search for a generator with good functional properties may be much easier in the space of functions, as opposed to the space of parameters. Throughout, this philosophy will serve as the basis of our method.

In more general technical term, the functional gradient of an objective function with respect to a model function is an inﬁnite dimensional gradient computed over the inﬁnite dimensional space of all models. In our case, our interest is the functional gradient of the GANs objective function with respect to the generator function. As we will show, all variations of GANs to date are implicitly using the discriminator function to compute the functional derivative of the objective function with respect to the generator distribution function. In other words, the discriminator determines the direction to which the algorithm should move the mass of the current generator distribution; the discriminator determines what the next (target) distribution looks like. The work of Nitanda & Suzuki (2018) is a pioneer study that explicitly incorporated this idea into the training of neural generator models. Their study showed that one can carry out a faithful functional gradient-based update by inserting what they call gradient layer into the layers of neural generator function. Johnson & Zhang (2018) further polished this strategy by periodically distilling the networks.

A more precisely worded advantage of the functional gradient method is that, at every stage in the training process of the generator distribution, it allows the user to monitor what the next distribution (target distribution) looks like in terms of the current distribution. The ability of the functional gradient update to tell something about the next (target) distribution is a signiﬁcant advantage of the functional gradient method over the conventional parametric methods, because the target distribution set forth by conventional parametric methods is usually expressed in a complicated parametric form (e.g. Deep Neural Nets (DNNs)), and its behaviors are often difﬁcult to predict. This advantage of the functional gradient based-update suggests the possibility that we can control the update rule in order to deliberately direct the generator distribution to a distribution with preferred properties which, in our case, include high entropy.

We discovered that, by locally concavifying the discriminator function, we can manipulate the functional gradient so that the next update for the generator will always target a distribution with higher entropy. From now on, we should refer to our method by Distributional Concavity (DC) regularization. Our method does not require the direct estimation of the entropy. We will not only show that our DC regularization can help improve the performance of GANs in terms of Inception score (Salimans et al., 2016) and Fr echet inception distance (FID) (Heusel et al., 2017), but also give a mathematical guarantee that our regularization will always increase the entropy of the generator distribution. We will also show that our method greatly outperforms the method of Dai et al. (2017) that directly estimates the entropy based on classically techniques.

The regularization strategy of monotonically increasing the entropy of the generator distribution over the course of its training is not without theoretical basis. Our DC regularization has close relations to the theory of optimal transport. We will show that, when the entropy of the true distribution is higher than the current generator distribution, the functional gradient that is properly derived from the optimal mass transport always increases the entropy of the distribution. Moreover, the functional update used in our method is always a monotonic mapping, which also turns out to be one of the properties satisﬁed by the update with optimal transport. This is a preferred property as well, because it tends to promote the smooth training of the generator. We summarize our contributions below:

We propose an update for the generator of GANs that promotes the entropy of the generator without the need for an explicit estimation of the actual entropy. We show that, when the entropy of the true distribution is larger than that of the current generator distribution, the functional gradient derived from the 2-Wasserstein (W2) optimal transport always increases the entropy of the distribution. We provide a mathematical guarantee that our method increases the entropy of the generator distribution at every step. We show that our method improves the results of GANs in terms of Inception score and FID score.

Published as a conference paper at ICLR 2019

2.1 GENERATIVE ADVERSARIAL NETWORKS

Let us ﬁrst review the formulation of the original GANs. Unless otherwise noted, let us use boldface capital letters to refer to a random variable, and lower case letter for its realization. The training process of GANs (Goodfellow et al., 2014) is a two player min-max game in which the generator distribution µθ is trained to minimize the divergence between the true distribution ν and the generator distribution µθ measured by the current critic F, while the critic F is trained to most strongly discriminate µθ from ν. Often, µθ is produced by applying a paramatric generator function Gθ to a random seed variable with some distribution µz, and it satisﬁes

Eµθ[h(X)] = Eµz[h(Gθ(Z))] (1)

for all measurable statistics h. In mathematical language, µθ is a pushforward of µz, and is often written as Gθ#µz. In a nutshell, GANs aim to ﬁnd an optimal parametric distribution µθ that achieves the minimum value for

min µθ max F Eν[Lr(F(X))] + Eµθ[Lg(F(X))] := min µθ max F V (µθ, F) (2)

where Lr and Lg are the functions of user s choice. We can retrieve the formulation of the original GANs (Goodfellow et al., 2014) by letting Lr(F(x)) = softplus( F(x)) and Lg(F(x)) = softplus(F(x)), where softplus( ) = log(1 + exp( )) There is a variety of choices for (Lr, Lg) (Arjovsky et al., 2017; Lim & Ye, 2017; Mao et al., 2017; Nowozin et al., 2016). Rewritten as the optimization problem about the function G, our objective function is given by

min µθ max F Eν[Lr(F(X))] + EG#µz[Lg(F(X)))] := min µθ max F V (G, F) (3)

2.2 FUNCTIONAL GRADIENT INTERPRETATION OF THE GANS UPDATE

Let us elaborate further on the mechanism of the functional-gradient based update and its connection to the conventional update of GANs . For ease of notation, let us write L = Lg F in the equation (3), where the operator designates the function-composition. Let us remind ourselves that the objective of the min part of the min-max game in each step is to ﬁnd a good update of µθ that decreases Eµθ[L(X)]. It is no exaggeration to say that the choice of the next target distribution of µθ or the distribution to which the µθ will be updated will completely determine the training efﬁciency of GANs. The most canonical and obvious approach is to use the information of the discriminator L in making this decision. Note that, by appealing to a standard argument based on Taylor expansion, L(x) L(x α L(x)) for any value of x when α is sufﬁently small. Therefore, leaving the mathematical technicalities aside, one may expect that a carefully chosen α can ensure

Eµθ[L(X α L(X))] Eµθ[L(X)]. (4)

One naive suggestion based on this intuition is the update of µθ to a distribution that can be constructed by taking a sample from µθ (say, x) and transporting it into the direction of α L(x). Using the pushforward notation, this amounts to the update from µθ to (Id α )#µθ. The presumed relation (4) is equivalent to

Eµz[L(Gθ(X) α L(Gθ(X)))] Eµz[L(Gθ(X))],

and the corresponding suggested update in terms of the function Gθ is an update from Gθ to (Id α L) Gθ. That is, our suggestion amounts to the update from Gθ#µz to ((Id α L) Gθ)#µz. Denoting Tα(x) := x α L(x), (5)

This is an update from Gθ#µz to (Tα Gθ)#µz. This update of G is called functional gradient update, and Tα is called a transport function. With enough (reasonable) regularity assumption of the function space and probability space, one can justify the argument we have made above using a differential calculus in an inﬁnite dimensional space of functions(See appendix A). As a spoiler for what we will further elaborate later, our method regularizes L so that the target distribution (Tα Gθ)#µz has higher entropy than µθ = Gθ#µz.

We are, however, still not done in our description of a functional gradient perspective of the GANs update. In what we formulated above, there is no guarantee that the target distribution (Tα Gθ)#µz

Published as a conference paper at ICLR 2019

admits the same parametric representation as Gθ; if (Tα Gθ)#µz cannot be expressed by the DNN that we have prepared for the training, all suggestions we have made above is for naught. The ﬁnal remaining task in this update procedure is therefore to ﬁnd the parameter θ such that Gθ #µz can best approximate the target distribution. Letting θold to denote the current choice of θ for the generator function, this can be done by solving the following optimization problem about θ :

min θ Eµz (Tα Gθold)(Z) Gθ(Z) 2 2 . (6)

The gradient of this sub-objective function (6) with respect to θ evaluated at θ = θold is given by

2Eµz θGθ(Z) (Tα Gθold)(Z) Gθ(Z) θ=θold (7)

= 2αEµz [ θGθ(Z)|θ=θold L(Gθold(Z))] (8)

where θ designates the derivative operator with respect to θ. This formulation is practically equivalent to the one introduced in x ICFG (Johnson & Zhang, 2018). This turns out to be the familiar gradient update used in the usual GAN implementation. Indeed, in the usual implementation, the parameter for the generator is updated with the rule:

θnew = θold 2α θEµz[L(G(Z; θ))] θ=θold (9)

= θold 2α Eµz [ θGθ(Z)|θ=θold L(G(Z; θold))] (10)

In general, almost all variations of GANs to date (Arjovsky et al., 2017; Lim & Ye, 2017; Mao et al., 2017; Nowozin et al., 2016) uses this type of update rule for the training of the generator. In other words, all methods to date has been implicitly doing the functional-gradient type update all along, and the choice of L has been determining the choice of the target distribution. As hinted a moment ago, this suggests that, by directly regularizing the L in the usual update scheme of GANs, one can realize a functional gradient update of the generator with a controlled target distribution. This is the very gist of our algorithm.

2.3 CHOICE OF THE TARGET DISTRIBUTION

As inferred above, from the perspective of functional gradient, the min-max game of GANs can be decomposed into steps: (i) the target construction step that constructs the next target distribution using the functional gradient, and (ii) the distillation step that looks for a neural function that better approximates the target distribution. Now, the next natural pressing question is: what type of L should we use to make sure that the next target distribution is nice? As inferred in the introduction, the goal of this particular study is to create a sequence of updates that is less likely to suffer from mode collapse. Let us recall that, as long as we follow the standard update procedure of GANs, the choice of L will entirely determine the property of the next target distribution. A proposal we would like to make in this study is to simply choose the discriminator L from the set functions that are concave on the support of the current distribution;

Proposition 2.1. Let µ be a probability distribution on Rd . If L is concave on the support of µ, then H(Tα#µ) H(µ).

Note that this statement is independent of the step size α, because any positive scalar multiple of concave function is concave. This result ensures that any update with a concave L will always increase the entropy; that is, the target distribution will be more dispersed than the current distribution, and the mode collapse is less likely to happen.

Additionally, our choice of L guarantees that the transport function used to created the target distribution satisﬁes another preferable property called monotonicity. Monotonicity is a property that requires that there is no crossings in the transport:

T(x) T(x ), x x Rd 0.

This is a preferred property because, as we will empirically show (see section 4.1) , the crossings in the transport tend to hinder the smooth training process. This is in fact somewhat intuitive, because crossings lead to wasteful transportation of the mass. In fact, this is a property that is achieved by the distributional update based on Optimal transport as well (Villani, 2008). Indeed, by the deﬁnition, the concavity of L implies the strong convexity of x2

2 αL(x), which in turn implies

Tα(x) Tα(x ), x x Rd x x 2 2 0, (11)

Published as a conference paper at ICLR 2019

which is a stronger condition than the monotonicity. Thus, by simply making L concave, we can not only construct a target distribution whose entropy is greater than that of the current generator distribution, but also assure that the corresponding transport is monotonic. In a way, this is a statement about how complicated the function is, because the presence of the crossings imply the existence of a discontinuity or a region of one to many mappings. This is therefore a property that is likely to affect the distillation step. In fact, we will empirically show that the property of monotonicity affects the distillation step in positive way.

Indeed, we do not intend say that our suggestion for the property of L is optimal the user has the freedom to choose the set of properties to be required for L so that the corresponding target distribution will have the desired properties that serves the purpose of the user.

2.4 RELATION TO OPTIMAL TRANSPORT

Our strategy for the distributional update that we have discussed so far has close relations to the optimal transport. In fact, an equation of the form (5) is ubiquitous in the theory of optimal transport. In this section, we will show the following:

1. By choosing L to be concave, the transport Tα in the equation (5) becomes the optimal transport from the current distribution µ to the target distribution Tα#µ. 2. If the entropy of the true distribution ν is greater than that of the current distribution µ, any sequence of the updates of the distribution along the optimal transport from µ to ν increases the entropy monotonically. We can also ensure the monotonic increase in the entropy by using Tα with concave L. 3. By choosing L to be concave, we can assure that the target distribution constructed from Tα will always be within α distance of the current distribution in Wasserstein (W2) sense.

In order to elaborate on these points, we would like to brieﬂy introduce the notion of optimal transport. For a more rigorous introduction of the concept, please consult Villani (2008). For p 1, the p-Wasserstein distance Wp(µ, ν) between two arbitrary distributions µ and ν with sufﬁcient regularity is given by inf T Eµ[ X T(X) p 2] 1/p , (12)

where the inf above is taken over all measurable maps T satisfying T#µ = ν. When p = 2, it is known that the inﬁmum of this cost function is achieved by T that can be expressed as x L for some L that renders x 2/2 L convex (Brenier, 1991). The search for L and hence T therefore amounts to the search for the closest coupling between µ and ν in the L2 sense.

Another surprising fact is that the movement of the particle along the direction of L monotonically decreases W2 (Villani, 2008). That is, if T t (x) = x t L (x), then W2(T t #µ, ν) monotonically decreases with t (Villani, 2003). Please compare this transport equation with the equation (5). This T t is indeed the optimal transportation analogue of the functional gradient. The theorem in Brenier (1991) has a still surprising converse; any function T that can be written as a gradient of a strictly convex function turns out to be the unique optimal transport from µ to T#µ. Using this fact, we can appeal to the proposition 2.2 below to claim that the functional update based on the optimal transport satisﬁes the following important property that shall be intuitively fulﬁlled if we are in fact moving the distribution toward the true distribution: if the current distribution has lower entropy than the true distribution, the sequence of the distributions produced by the optimal transport based updates should be monotonically increasing in the entropy:

Proposition 2.2. Suppose ν = T#µ for some T that can be written as a gradient of a strictly convex function. If H(ν) H(µ), then H(Tt#µ) is monotonically increasing on t [0, 1].

This proposition actually assures that the updates of DC regularization also satisfy the same property. Because we are designing the target distribution so that it will have higher entropy than the current distribution at every step, by letting ν in the proposition 2.2 to be the target distribution, we have a guarantee that the sequence of distributions constructed by our updates is also monotonically increasing in the entropy. To see this, simply note that we can automatically make x 2/2 L to be convex by making L concave,

Published as a conference paper at ICLR 2019

The connection between our DC regularization and optimal transport is not limited to the property we introduced above. By the theory of Brenir, the choice Tα = Id α L with concave and 1-Lipschitz L satisﬁes W2(µ, Tα#µ) = p

E[ α L(X) 2 2] = α. Put in still other words, by constructing a target distribution with concave L, we can also assure that the target distribution is contained within the W2 neighborhood of the current distribution function. This is a property that cannot be guaranteed with the conventional parameter-based updates. Monotonicity condition is also a property that is satisﬁed by the optimal transport. According the theory of Monge-Ampere (Villani (2008)), a transport map can be optimal only if it is monotonic. Thus, while not being exactly based on the optimal transport from the generator distribution to the true distribution, our T shares many favorable properties in common with the optimal transport, and is very closely related to the theory of W2 distance.

Reﬂecting on these facts, one might become tempted to say that we shall simply formulate GANs with the objective function that is solely based on W2 distance between the generator distribution and the true distribution. However, as we will further articulate in the discussion section, the challenge remains to conduct faithful W2-based updates in the implementation of GANs.

If µθ := Gθ#µz is our current parametric generator distribution and ν is the true distribution, our DC regularization method ﬁrst (i) proposes a target distribution T#µθ using a concave L, and then (ii) seek a measure µθnew that well approximates the target distribution T#µθ. We use the following sampling-based penalty term in order to promote the concavity of L (= Lg F):

Ldc(F, ϵ, x1, x2, d) = max{L(ϵx1 + (1 ϵ)x2) ϵL(x1) (1 ϵ)L(x2), d}, (13)

where x1, x2 are samples from the support of µθ, ϵ is a sample from the uniform distribution over [0, 1], and d is a positive scalar. Note that this term must be positive if L is concave over the support of µθ. Our algorithm is summarized in Algorithm 1. The update rule in this algorithm uses the target distribution constructed by T(x) = x L with concave L. Intuitively speaking, the transport T has an effect of moving µθ toward T#µθ while dispersing the mass of µθ. See Fig 1 for a visual rendering of this interpretation.

In most practical application, the training of GANs begins by dispersing the initial distribution with small entropy and gradually molds the mass into what resembles the true distribution. Our transport ensures that this dispersion is consistently happening over the course of the training. As we will show in the result section, this regularization works in favor of the inception score without any noticeable downfall from over-dispersion, suggesting that the generator distribution created with the current state of the art techniques are still not dispersed enough. Algorithm 1 GANs algorithm with DC regularization

for each iteration do

θold θ the target construction step : ﬁnd F that optimizes

max F V (G, F) + λEUniform(ϵ)Eµθold(X1,X2)[Ldc(F, ϵ, X1, X2, d)] (14)

and construct Tα from L := Lg F as in the equation (5). the distillation step : update generator by generator s objective

min θ Eµz (Tα Gθold)(Z) Gθ(Z) 2 2 (15)

4 EXPERIMENTAL RESULTS

We applied DC regularization to the training of GANs on CIFAR-10 (Torralba et al., 2008), CIFAR100 (Torralba et al., 2008) and ILSVRC2012 dataset (Image Net) (Russakovsky et al., 2015) in various settings and evaluated its performance in terms of Inception score (Salimans et al., 2016) and Fr echet inception distance (FID) (Heusel et al., 2017). Inception score and FID are performance measures that are commonly used to evaluate the severity of mode collapse. For the details of the evaluation, see Appendix C.1. We also conducted an additional set of experiments with artiﬁcal

Published as a conference paper at ICLR 2019

loss surface

gradient of loss

(a) L is not convex

loss surface

(b) L is convex Fig 1: The graph of L and its gradient vector ﬁeld. Each point x will be transported by T along the direction of the vector ﬁeld L. Over the set on which L is not concave, T may move the points in the region come closer to each other. The use of concaviﬁed L for the construction of transport function will make the points move away from each other.

dataset to investigate the properties of DC-regularization. We provide the results for additional experiments in the Appendix D as well.

4.1 INTRINSIC PROPERTIES OF DC REGULARIZATION

Effect of DC regularization on entropy Using a simple Gaussian Mixture Model with ﬁve modes as the true distribution, we evaluated the sheer ability of the DC regularization to promote the entropy of the generator distribution. We used hinge loss for the objective function, and used DNNs to model both the discriminator and the generator. We trained the model with and without the DC regularization and reported the entropy of the Generator at different stages of the training by explicitly computing the determinant of the Jacobian at each layer. As one of the baseline, we trained EGAN-Ent-VI (Dai et al., 2017), which includes into its objective function a penalty against the negative entropy of the generator. The result is illustrated in Fig 2 (a). We see that the regularization is positively affecting the entropy at all stages of the training. Although to a less extent, we can conﬁrm that our implementation of EGAN-Ent-VI is also preventing the degeneration of the entropy. Indeed, the persistent pressure to increase the entropy can result in over-dispsersed ﬁnal product. However, this seems not to be a serious issue when it comes to the learning on big data like Image Net. In terms of Inception score and FID score, all artiﬁcial generator distributions today are still far less diverse than the original dataset (Table 1), and there is still a large room left for the improvement of diversity.

Effect of monotonicity in distillation step Recall that each round of GANs update consists of the target construction step and the distillation step. We conducted an experiment to verify the effect of the monotonicty on the distillation step. We will show a case in which, even if the target distribution constructed from a non-monotonic map is further away from the current distribution than the target distribution constructed from a monotonic map, the projection of the latter distribution onto the parametric function-space is much easier. Let K be a positive value. Starting from an initial distribution X = Gθ0(Z) , consider the following pair of maps to be applied to X:

T (m)(x) = (x K)1( 1)mx 0 + (x + K)1( 1)mx 0 (16)

It is evident that T (2) is a monotonic mapping and T (1) is not. Denote the law of X by µ. We used T (1)#µ and T (2)#µ as target distributions, and trained the parameterθ using Om = EZ[ T (m)(Gθ0(Z)) Gθ(Z) 2] as the objective function (Distillation). Note that this objective function takes the same value of K2 for both m = 1, 2. Also, by the deﬁnition of the Wasserstein distance, K = W2(T (2)#µ, µ) while W2(T (1)#µ, µ)2 = inf T #µ=T1#µ Eµ[ T(X) X 2] Eµ[ T (1)(X) X 2] = K2. A naive intuition dictates that the distillation of T (1)#µ is easier, because T (1)#µ is distributionally closer to µ while the objective function evaluated at T (1)#µ is same as the evaluation at T (2)#µ. However, as we can see in Fig 2 (b), the training about O2 proceeds much faster than the training about O1. This seemingly unintuitive observation can be supported by the theory of optimal transport. Note that T1 is a derivative of K|X| and T (2) is a derivative of K|X|, and the latter is a convex function. This implies that T (2) is the optimal transport from µ to T (2)#µ. This result (Fig 2 (b)) suggests that, when using the update rule similar to the equation 10, the training proceeds much faster when we choose a target distribution with monotonic mapping.

Published as a conference paper at ICLR 2019

0 500 1000 1500 2000 2500 3000 3500 4000 iteration

basline 0 iterartion start (lam = 1.0) 0 iterartion start (lam = 10.0) 2000 iteration start (lam = 10.0) 3000 iteration start (lam = 10.0) EGAN

(a) Effect on entropy

0 200 400 600 800 1000 iteration

mean square error

monotone cross over

(b) Effect of monotonicity Fig 2: The left graph plots the transition of the entropy of the generator distribution of GANs trained for the mixture of Gaussians. The marker k th iteration starts designates the result for which our regularization was applied after kth step. We see that our method has the effect of preventing the entropy degeneration. The right graph plots the transitions of the energy for the distillation step when the target distributions were constructed by applying monotonic map (red) and non-monotonic map(blue) to a gaussian distribution. Notice that the distillation step converges faster when the target distribution is constructed from monotonic map.

4.2 RESULTS ON CIFAR-10 AND CIFAR-100

Experiments with different architectures and objective functions We tested our algorithm for the training of GANs with six types of objective functions and two network architectures, and reported the performance on all 12 combinations. For the details of the objective functions and the architectures, please see Appendix C.2, C.3, C.4 For the training of the network, we applied Spectral Normalization (SN) (Miyato et al., 2018) to the full-connected layer and the convolution layer. Fig 3 and Fig 8 (in Appendix) respectively summarize Inception scores and FID for all 12 settings. We see that DC regularization is improving the performance irrespective of the choice of the architecture and the objective function.

Experiments with different prior dimensions In general, without careful selection of the dimension of the prior distribution, the training of GANs tends to suffer a serious case of mode collapse. For image generation task, this will result in low inception score. We therefore applied DC regularization to the trainings of GANs on CIFAR-10 with very low prior dimensions as well as very large prior dimensions and evaluated the performance. For this experiment, we used GAN-variant2 (Appendix C.2) for the objective function and used SNDCGAN for the architecture. As for the experimental details, please see Appendix C.4. The results are summarized in Fig 4 and Fig 10 (in Appendix). When the prior dimension is below the inherent dimension of the dataset, the dimension of the generator distribution cannot match the dimension of the true distribution. Thus, the inception score of the generator trained with low dimensional prior is bound to be low, irrespective of the application of DC regularization. However, for dim(z) 5, the inception score is as high as 7.5, and for all choices of dim(z), the generator trained with DC regularization consistently outperformed the generator trained without the DC regularization. Similar argument applies to the performance of DC regularization for high dimensional prior. Overall, the DC regularization provides some robustness against the choice of the prior dimension.

SNDCGAN SNDCGAN+DC-reg 0

Inception score

SNRes Net SNRes Net+DC-reg 0

(a) CIFAR-10

SNDCGAN SNDCGAN+DC-reg 0

Inception score

SNRes Net SNRes Net+DC-reg 0

8 GAN-vanilla GAN-variant1 GAN-variant2 GAN-hinge WGAN-GP LSGAN

(b) CIFAR-100 Fig 3: The inception scores of different GAN methods on CIFAR-10 and CIFAR-100 (higher the better). The right group of bars in each graph represents the set of scores achieved by the implementations with our regularization. The left group in each graph represents the set of scores achieved by the implementation without the regularization. Our regularization improves the inception score for all methods.

Published as a conference paper at ICLR 2019

4 6 8 10 dimension of Z

Inception score

baseline proposal

(a) low prior dimension

2000 4000 6000 8000 10000 dimension of Z

Inception score

baseline proposal

(b) high prior dimension Fig 4: The performance of the DCGAN in terms of the inception score for CIFAR10 plotted against the dimension of µz(standard Gaussian). The baseline is the DCGAN with spectral normalization(SN). We can see that too low a dimension and too high a dimension both negatively affect the performance. Notice that the DCGAN with DC regularization outperforms the baseline DCGAN for all extreme dimensions.

Comparison with EGAN and other methods We compared the performance of our algorithm against EGAN-Ent-VI (Dai et al., 2017), another framework that can be used to control the entropy of the generator. We conducted this comparative study using SNDCGAN (Table 2) and SNRes Net (Table 3), which uses Spectral Normalization that is known to have an effect of preventing the degeneration of the feature space. SNDCGAN is a method that is known to perform well on Cifar10 on its own. For the experimental setting of this study, please see the Appendix C.4. As we show in Table 1, DC regularization outperformed EGAN-Ent-VI on both models. We reported the result of EGAN-Ent-VI with spectral normalization, because it worked better than the version without SN. As we can see in the table 1, the EGAN with SN performed worse than the vanilla SNDCGAN. For high dimensional dataset like Cifar10, the variational inference for the negative entropy can be extremely difﬁcult. It is possible that a poor variational inference in high dimensional space backﬁred for EGAN-VI s performance on Cifar10. We would also like to emphasize here that EGAN-Ent-VI requires a separate decoder in addition to the generator and the discriminator, and that our algorithm is easier to implement. For the full version of the Table and the visuals of the generated samples, please see Table 8 and Table 11 in the Appendix. We compared our algorithm against other competitive methods as well. The best performance of our method is almost on par with the state-of-the-art method (Karras et al., 2018). We are losing to progressive GAN (Karras et al., 2018) by a slight margin; we, however, would like to make a disclaimer that we are using much smaller architecture than Karras et al. (2018) for the performance evaluation of our method. We also conﬁrmed that we can improve the result of Miyato et al. (2018) by using DC regularization together with SN. The results support that our method is helping the training process suppress mode collapse and is improving the overall performance. For the results of the experiments conducted with Wasserstein GAN with gradient penalty (WGAN-GP) (Gulrajani et al., 2017), please see the Fig 9 in Appendix.

4.3 RESULTS ON IMAGENET

Additionaly, to evaluate our method s effectiveness on a higher dimensional dataset, we applied our method on Image Net with 1000 classes, with each class containing approximately 1300 images. We compressed the images to 64 64 pixels prior to the experiments.

0 10 20 30 epoch

Inception score

baseline proposal

Fig 5: Performance of DC regularization on Image Net. The baseline used in the comparison here is the GAN implemented with SN for Hinge-loss. The training with DC regularization achieves higher score consistently for all epochs.

Published as a conference paper at ICLR 2019

For the objective function, we chose GAN-hinge (Appendix C.2), which was being used in Miyato et al. (2018). For the experimental settings, please see Appendix C.5 for the details. We can conﬁrm on Fig 5 that our DC regularization is improving the Inception score throughout the course of the training.

Table 1: Inception scores and FIDs for unsupervised image generation on CIFAR-10 and CIFAR100. The CIFAR-10 results for the models designated with are cited from (Miyato et al., 2018), and the CIFAR-10 results with are cited from (Karras et al., 2018).

Method Inception score FID CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100

Real data 11.24 14.79 7.6 8.94

EGAN-Ent-VI(SNDCGAN) 6.95 .08 6.62 .10 29.0 33.3 EGAN-Ent-VI(SNRes Net) 7.31 .12 6.67 .10 27.0 30.5

proposal SNDCGAN + DC reg 8.08 .12 8.12 .11 24.6 25.8 SNRes Net + DC reg 8.27 .08 8.27 .13 24.3 24.6 SNResnet Large + DC reg 8.41 .10 8.20 .08 20.6 24.8 SNResnet Large-hinge + DC reg 8.29 .09 8.41 .11 19.5 23.6

baseline SNDCGAN-hinge 7.58 .12 7.57 .07 25.5 28.1 SNResnet Large-hinge 8.22 .05 7.54 .13 21.7 26.6 Progressive GANs 8.56 .06

5 DISCUSSION

Dilemma of GANs update As we have shown above, the experimental results support our claim that the DC regularization promotes the entropy of the generator distribution and stabilizes the training process of GANs. As mentioned brieﬂy in the theory section 2, however, we do not have the guarantee that our update rule consistently reduces the W2 distance between the generator distribution and the true distribution. This fault, however, is common to almost all GANs algorithms today. As we have shown, the conventional update rule is implicitly targetting a distribution of the form T#µ where T = Id α L for some α and a discriminator L. As it turns out, optimal transport takes this form only when we are optimizing W2 distance, while the common constraint of | L| = 1 is a condition required for dual potential when p = 1 (W1) (Villani, 2008). In other words, the conventional GANs are making the discriminator with W1 criteria and updating the generator with W2 criteria. If we are to faithfully create the discriminator with W2 criteria, we must look for a Legendre-pair of dual potential functions. On the contrary, if we are to update the generator with W1 criteria, one must look for a closed form solution for the W1 transport. This, however, is in general a highly complex mathematical problem for which there is a separate ﬁeld of study (Santambrogio, 2015). To the authors best knowledge, no studies have provided a solid solution to this dilemma.

Convexity vs Strong convexity We would like to also mention that our regularization is asking for more than what is required by W2 theory. As mentioned above, in order for T to be the optimal transport from µ to T#µ , T only needs to be the gradient of a convex function. On the other hand, by asking L to be concave, we are in fact asking for T to be the gradient of a strongly convex function. This overdo is actually intentional. Recall that the parameter α in the transport Tα = Id α L corresponds to a step size in the update of the generator. If we train L that only guarantees the convexity of x 2/2 L(x), the functional update derived from such L can be non-monotonic when the step size is large, because such L only guarantees the convexity of x 2/2 αL(x) when α 1. Should we require L to be concave, however, x 2/2 αL(x) is concave for any positive α. In this context, we may therefore say that the DC regularization leaves much room to be playful about the learning schedule. Finally, as can be inferred from our proofs for the proposition 2.1 and 2.2, the strong convexity of T has an effect of diffusing the mass of pdf in every direction. We can therefore expect the DC regularization to disperse the collapsed masses in the case of mode collapse. In the light of the fact that the training of GANs usually begins with dense distribution with small support, this dispersion effect should be helpful at the early stage of the training as well.

Published as a conference paper at ICLR 2019

ACKNOWLEDGMENTS

We would like to thank the members of PFN.Inc, particularly Daisuke Okanohara, Kouhei Hayashi, Masaki Watabnabe, Shin-ichi Maeda, Sosuke Kobayashi, Takeru Miyato, Kenta Oono, and Toshiki Kataoka for helpful comments and advices. We would also like to thank Atsushi Nitanda for constructive advices

Martin Arjovsky and L eon Bottou. Towards principled methods for training generative adversarial networks. In ICLR, 2017.

Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In ICML, pp. 214 223, 2017.

Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375 417, 1991.

Theophilos Cacoullos. Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics, 18(1):179 189, 1966.

Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, and Aaron Courville. Calibrating energy-based generative adversarial networks. In ICLR, 2017.

DC Dowson and BV Landau. The fr echet distance between multivariate normal distributions. Journal of Multivariate Analysis, 12(3):450 455, 1982.

Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. ar Xiv preprint ar Xiv:1701.00160, 2016.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pp. 2672 2680, 2014.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein GANs. In NIPS, pp. 5769 5779, 2017.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, G unter Klambauer, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a nash equilibrium. In NIPS, pp. 6629 6640, 2017.

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, pp. 5967 5976, 2017.

Rie Johnson and Tong Zhang. Composite functional gradient learning of generative adversarial models. In ICML, pp. 2376 2384, 2018.

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In ICLR, 2018.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

Jae Hyun Lim and Jong Chul Ye. Geometric GAN. ar Xiv preprint ar Xiv:1705.02894, 2017.

Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples in generative adversarial networks. ar Xiv preprint ar Xiv:1712.04086, 2017.

Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In ICCV, pp. 2813 2821, 2017.

Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. ar Xiv preprint ar Xiv:1611.02163, 2016.

Published as a conference paper at ICLR 2019

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.

Atsushi Nitanda and Taiji Suzuki. Gradient layer: Enhancing the convergence of adversarial training for generative models. In AISTATS, pp. 1008 1016, 2018.

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In NIPS, pp. 271 279, 2016.

Arkadas Ozakin and Alexander G Gray. Submanifold density estimation. In Advances in Neural Information Processing Systems, pp. 1375 1382, 2009.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015.

Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular value clipping. In ICCV, pp. 2849 2858, 2017.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In NIPS, pp. 2226 2234, 2016.

Filippo Santambrogio. Optimal transport for applied mathematicians. Birk auser, NY, pp. 99 102, 2015.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pp. 1 9, 2015.

Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):1958 1970, 2008.

C edric Villani. Topics in Optimal Transportation. Graduate studies in mathematics. American Mathematical Society, 2003.

C edric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, pp. 2242 2251, 2017.

Published as a conference paper at ICLR 2019

Appendix (Supplemental Material for Distributional Concavity Regularization for GANs)

A MORE MATHEMATICAL RENDITION OF SECTION 2

In this section, we re-explain the section 2 with more mathematical details. For ease of argument, we will assume the followings from here onward. First, as is the case in many application of GANs, we will assume that ν is a probability measure on the Euclidean space Rd, and µz is a probability measure on some Euclidean space U. Let us also assume that Gθ : U Rd is a parametric function of θ that is almost everywhere differentiable with respect to both input and θ. We will ﬁrst consider the optimization problem of the equation (3) with gradient descent about θ. For ease of notation, let us write L = Lg F, where the operator designates the function-composition. We will assume that L is a continuously differrentiable and uniformly bounded function on Rd, and that the Euclidean norms of its gradient and operator norm of Hessian are also uniformly bounded. This type of regularity assumptions on the objective functions are used in other literatures as well (Nitanda & Suzuki (2018), Johnson & Zhang (2018)), and it is not too all unrealistic because the support of both target distributions and generator distribution are often compact throughout the course of the training in the real applications. We will also assume that the derivatives of all functions to appear are well deﬁned and bounded on Rd. The update rule of θ is given by: θnew = θold α θEµz[L(G(Z; θ))] θ=θold (17)

= θold 2α Eµz [ θGθ(Z)|θ=θold L(G(Z; θold))] (18) where θ designates the derivative operator with respect to θ. In general, almost all variations of GANs to date (Arjovsky et al., 2017; Lim & Ye, 2017; Mao et al., 2017; Nowozin et al., 2016) uses this type of update rule for the training of the generator. Let us denote the law of Gθ(Z) (or X) by µθ. Let L2(µθ) denote the Hilbert space of L2 integrable maps from Rd to itself, granted with the inner product a, b µθ = Eµθ[a(X)T b(X)]. We will show that we can re-derive the equation (18) using the theory of functional gradient. To begin with, instead of updating the parameter θ, we would like to consider directly updating the distribution µθ by applying a T L2(µθ) to a random variable generated from µθ. This will result in a new distribution T#µθ, which is a unique distribution that satisﬁes Eµθ[h(T(X)] = ET #µθ[h(X)] for all measurable h.

Now, in order to create an update rule for T, let us consider the Gˆateaux derivative of M(T) := ET #µθ[L(X)] with respect to T into an arbitrary direction δ L2(µθ). In terse term, we would like to consider the effect of purturbing T into the direction of δ. By the compactnesss assumption on Ωtogether with the regularity assumptions for the functions deﬁned hereof, we can appeal to the dominated convergence theorem and see that

lim ϵ 0 M(T + ϵδ) M(T)

ϵ = Eµθ[ L(T(X))T δ(X)]. (19)

is uniform for all δ L2(µ). This way, the term L(T( )) in the expression above is a Fr echet derivative, or the type of derivative to which the usual set of derivative rules can be applied. We will refer to this derivative as functional derivative for short. Note that the directional functional derivative Eµθ[( L(T(X)) δ)( )] will always take a positive value when δ L(T( )). Thus, for α small enough, an update from the current choice of T to T α L L2(µθ) will decrease the objective function. Here, we are interested in the derivative computed at T = Id, so the transformation we would like to apply to µθ is Tα(x) = x α L(x). (20) This is indeed a transportation of a mass into the direction of α L. Now, for the update of θ, we would like to design θnew such that µθnew is closer to the target distribution T#µθold than µθold. That is, we would like to choose θ that minimizes min θ Eµz (Tα Gθold)(Z) Gθ(Z) 2 2 . (21)

The gradient of this sub-objective function with respect to θ is given by 2Eµz θGθ(Z) (Tα Gθold)(Z) Gθ(Z) θ=θold (22)

= 2αEµz [ θGθ(Z)|θ=θold L(Gθold(Z))] . (23) Evaluating this at θ = θold, we recover the gradient used in the equation (18). This formulation is practically equivalent to the one introduced in x ICFG (Johnson & Zhang, 2018).

Published as a conference paper at ICLR 2019

B THE PROOFS FOR THE PROPOSITIONS 2.1 AND 2.2

We will ﬁrst prove the proposition 2.2. Let us begin with a useful lemma. We will assume that all regularity assumptions made in Section 2 holds for all variables and functions that appear in this section. Also, we will use the Df to denote the derivative of f. Also assume that, unless otherwise noted, Df(x) indicates Dxf(x), the derivative of f with respect to x. Lemma B.1. Let µ and ν be probability distributions on Rn, and suppose that we can write (Id L)#µ = ν with one-to-one L. Let us also write T = (Id L), and let Tt = Id t L be its time-linear interpolation. Then, for all t [0, 1], d dt H(Tt#µ) = Eµ tr(I t D2L(X)) 1(D2L(X)) . (24)

Proof. Let pµ be the pdf of µ. If y = T(x), by the one-to-one assumption det(DT(x)) > 0 and we may let dy = det(DT(x))dx. By appealing to the fact that Eµ[h(T(X))] = Eν[h(Y )] for arbitrary h, Z h(T(x))pµ(x)dx = Z h(y)p T #µ(y)dy = Z h(T(x))p T #µ(T(x))det(DT(x))dx (25)

and we can deduce the identity p T #µ(T(x))det(DT(x)) = pµ(x). Building on this fact, with straightforward computation we can say

H(Tt#µ) = ETt#µ[log p Tt#µ(X)] (26)

= Z pµ(x) log p T #µ(Tt(x))dx (27)

= Z pµ(x) (log pµ(x) log det(DTt(x))) dx (28)

= H(µ) + Eµ [log det(DTt(X))] . (29)

Assuming that the distributions are regular enough that we can swap the integral, we can appeal to Jacobi s formula and deduce d dt H(Tt#µ) = d

dt Eµ [log det(DTt(X))] (30)

dt Eµ log det(I t D2L(X)) (31)

det(DTt(X))tr(I t D2L(X)) 1( D2L(X))

det(DTt(X))

= Eµ tr(I t D2L(X)) 1(D2L(X)) . (33)

Now, let us prove the proposition 2.2. Without loss of generality, choose L so that DT = x 2

2 L. Assuming that D2L is diagonalizable and that its spectrum is uniformly bounded away from 1, let spec(D2L) = {λk} and write diag{λk}k = Λ. Because x 2

2 L is assumed to be strictly convex,

2 L = I D2L is positive deﬁnite, and 1 λk(x) > 0 for all k. Appealing to the line

equation (29) we see that the assumption H(ν) > H(µ) is equivalent to

Eµ [log det(T(X))] = Eµ log det(I D2L(X))

k log(1 λk(X))

Let us write maxk{1 tλk(x)} = Ct(x) > 0. Using the assumption that the support of µ is compact, we can also say 0 < maxsupp(µ){Ct(x)} := ct < . In general, if A is positive deﬁnite and diagonalizable, with straightforward argument we can say

log(det A) tr(A I) (35)

Published as a conference paper at ICLR 2019

Using this fact, we see that

0 < Eµ log det(I D2L(X)) (36)

Eµ tr( D2L(X)) (37)

Suppose that D2L is diagonalizable with the change of basis matrix P. Then

Eµ log tr (I t D2L(X)) 1D2L(X) = Eµ tr (P 1(I tΛ)P) 1P 1Λ(X)P (39)

= Eµ tr P 1(I tΛ) 1PP 1Λ(X)P (40)

= Eµ tr (I tΛ(X)) 1Λ(X) (41)

λk(X) 1 tλk(X)

This concludes the proof of the proposition 2.2. The proposition 2.1 is much simpler to prove. In order to assure the H(ν) H(µ), we only need to guarantee that Eµ [log det(DT(X))] 0. By requiring L to be concave, we would make x 2/2 L(x) to be convex so that the eigenvalues of DT will all become positive. The result then follows trivially from the argument similar to the one that leads to equation (34).

C EXPERIMENTAL SETTINGS

C.1 PERFORMANCE MEASURES

For the measure of GANs performance on the image dataset, we used Inception score (Salimans et al., 2016). Inception score was introduced originally as an exponentiated divergence measure based on the trained Inception convolutional neural network (Szegedy et al., 2015), which is often called Inception Model. Using p(y|x) to denote the Inception model, the inception score is given by I({xn}N n=1) := exp(ˆE[DKL[p(y|x)||p(y)]]), where ˆE indicates the empirically approximated expectation. Often times, p(y) is approximated with 1

N PN n=1 p(y|xn).

The dominating consensus among the machine learning community is that this score is strongly correlated with subjective human judgment of image quality. Following the procedure in Salimans et al. (2016), we generated 5000 examples from each trained generator and calculated the Inception score on the samples. We evaluated the score 10 times with different seeds for the generation of xn and reported the average and the standard deviation of the scores.

Fr echet inception distance (FID) (Heusel et al., 2017) is another measure for the quality of the generated examples that uses 2nd order information of the ﬁnal layer of the inception model. The FID is based on Fre chet distance (Dowson & Landau, 1982) (Not to be confused with FID), which is the 2-Wasserstein distance between two distribution multivariate Gaussian distributions, p1 and p2. The Wasserstein distance for two Gaussian distributions have a closed form, and is given by

F(p1, p2) = µp1 µp2 2 2 + trace Cp1 + Cp2 2(Cp1Cp2)1/2 , (45)

where {µp1, Cp1}, {µp2, Cp2} are the mean and covariance of samples from q and p, respectively. Now, if f is the output of the ﬁnal layer of the inception model before the softmax, the Fr echet inception distance (FID) between two distributions p1 and p2 on the images is the Fre chet distance distance between f p1 and f p2. We empirically computed the Fr echet inception distance between the true distribution and the generated distribution using 10000 samples from the true distribution and 5000 samples from the generator distribution.

Published as a conference paper at ICLR 2019

C.2 GAN S OBJECTIVE FUNCTION

We used applied DC regularizaiton to the training with the following set of objective functions (softplus(x) = log(1 + exp(x))).

GAN-vanilla (Goodfellow et al., 2014) min G max F Eν[softplus(F(X))] + Eµz[softplus( F(G(Z)))] (46)

GAN-variant1 (Goodfellow et al., 2014) max F Eν[softplus(F(X)] + Eµz[softplus( F(G(Z)))] (47)

min G Eµz[ softplus(F(G(Z)))] (48)

GAN-variant2 max F Eν[softplus(F(X)] + Eµz[softplus( F(G(Z)))] (49)

min G Eµz[ F(G(Z))] (50)

GAN-hinge (Lim & Ye, 2017) max F Eν[min(0, 1 + F(X))] + Eµz[min(0, 1 F(G(Z)))] (51)

min G Eµz[ F(G(Z))] (52)

WGAN-GP (Gulrajani et al., 2017)

max F Eν[F(X)] Eµz[F(G(Z))] + λEˆµ[( ˆ XF( ˆ X) 2 1)2] (53)

ˆµ is the law of ux + (1 u)z, with (x, z, u) µ ν U[0, 1]. min G Eµz[ F(G(Z))] (54)

LSGAN (Mao et al., 2017) min F Eν[((F(X) 1)2] + Eµz[(F(G(Z)) + 1)2] (55)

min G Eµz[(F(G(Z)) 1)2] (56)

Feature Matching (Salimans et al., 2016) max F Eν[min(0, 1 + F(X))] + Eµz[min(0, 1 F(G(Z)))] (57)

min G ||Eν[φ(X)] Eµz[φ(G(Z))]||2 (58)

F is a linear function of φ ( w with w T φ = F)

C.3 NETWORK ARCHITECTURES Table 2: The architecture of DCGAN(Radford et al., 2016) for image Generation experiments on CIFAR-10 and CIFAR-100. The slopes of all leaky-Re LU functions in the networks were set to 0.2.

z R128 N(0, I)

dense Mg Mg 512

4 4, stride=2 deconv. BN 256 Re LU

4 4, stride=2 deconv. BN 128 Re LU

4 4, stride=2 deconv. BN 64 Re LU

3 3, stride=1 conv. 3 Tanh

(a) Generator, Mg = 4 for CIFAR-10 and CIFAR-100

RGB image x RM M 3

3 3, stride=1 conv 64 l Re LU 4 4, stride=2 conv 64 l Re LU

3 3, stride=1 conv 128 l Re LU 4 4, stride=2 conv 128 l Re LU

3 3, stride=1 conv 256 l Re LU 4 4, stride=2 conv 256 l Re LU

3 3, stride=1 conv. 512 l Re LU

(b) Discriminator, M = 32 for CIFAR10 and CIFAR100

Published as a conference paper at ICLR 2019

Fig 6: Resblock architectures for CIFAR-10 and CIFAR-100. We used similar architectures as the ones used in Gulrajani et al. (2017)

Table 3: Res Net architectures for CIFAR-10 and CIFAR-100. We used similar architectures as the ones used in Gulrajani et al. (2017).Resblock model is Fig 6

z R128 N(0, I)

dense, 4 4 128

Res Block up 128

Res Block up 128

Res Block up 128

BN, Re LU, 3 3 conv, 3 Tanh

(a) Generator

RGB image x R32 32 3

Res Block down 128

Res Block down 128

Res Block 128

Res Block 128

Global sum pooling

(b) Discriminator

Table 4: Res Net generator architectures (large version) for CIFAR-10 and CIFAR-100.Resblock model is Fig 6

z R128 N(0, I)

dense, 4 4 128

Res Block up 256

Res Block up 256

Res Block up 256

BN, Re LU, 3 3 conv, 3 Tanh

(a) Generator

Published as a conference paper at ICLR 2019

Table 5: Res Net architectures for image generation on Image Net dataset.Resblock model is Fig 6

z R128 N(0, I)

dense, 4 4 1024

Res Block up 1024

Res Block up 512

Res Block up 256

Res Block up 128

Res Block up 64

BN, Re LU, 3 3 conv 3

(a) Generator

RGB image x R64 64 3

Res Block down 64

Res Block down 128

Res Block down 256

Res Block down 512

Res Block down 1024

Res Block 1024

Global sum pooling

(b) Discriminator for unconditional GANs.

Table 6: EGAN(Dai et al., 2017) s decoder models used in our experiments on CIFAR-10 and CIFAR-100 . The slopes of all l Re LU functions in the networks were set to 0.2. Fig 6 illustrates our Resblock model.

RGB image x RM M 3

3 3, stride=1 conv 64 l Re LU 4 4, stride=2 conv 64 l Re LU

3 3, stride=1 conv 128 l Re LU 4 4, stride=2 conv 128 l Re LU

3 3, stride=1 conv 256 l Re LU 4 4, stride=2 conv 256 l Re LU

3 3, stride=1 conv. 512 l Re LU

dense 128 * 2

(a) Convolution decoder, M = 32 for CIFAR10 and CIFAR-100

RGB image x R32 32 3

Res Block down 64

Res Block down 128

Res Block down 256

Res Block down 512

Res Block down 1024

Res Block 1024

Global sum pooling

dense 128 * 2

(b) Res Net decoder.

Published as a conference paper at ICLR 2019

C.4 DETAILS FOR THE EXPERIMENTS ON CIFAR-10 AND CIFAR-100

Experimental setting for the evaluation of the method s robustness against the choice of objective functions For the model, we chose the architecture of DCGAN(Radford et al., 2016) (Table 2) and Res Net (Table 3) with spectral normalization applied to full connect layer and convolution layer(SNDCGAN, SNRes Net). For the optimization, we used Adam(Kingma & Ba, 2015) and chose (α = 0.0002, β1 = 0, β2 = 0.9) for the hyperparameters. Also, we chose ndis = 1, ngen = 1 for SNDCGAN and (ndis = 5, ngen = 1) for SNRes Net. We updated the generator 100k times and linearly decayed the learning rate over last 5k iterations.

Experimental setting for the evaluation of the method s robustness against the choice of the prior dimension We chose SNDCGAN for the model, and optimized the network with Adam using the hyperparameter (α = 0.0002, β1 = 0, β2 = 0.9). For the update of the discriminator and the generator, we set ndis = 1, ngen = 1. We trained the generator 50k times and linearly decayed the learning rate over last 5k iterations. For the objective function, we chose GAN-variant2 in Appendix C.2. We also set λ = 3.0, d = 0.01 for the parameters in (13). For this set of the experiments, we repeated the experiments three times with different seeds, and reported the maximum, mean and minimum. We tested with prior dimensions of range dim(z) = 3, 4, 5, 7, 10, 500, 1000, 2000, 4000, 6000, 8000, 10000.

Comparison with EGAN and other methods For the model, we chose SNDCGAN , SNRes Net and SNRes Net Large, and trained the networks using Adam with hyperparameters (α = 0.0002, β1 = 0, β2 = 0.9). SNRes Net Large is a model that was used in (Miyato et al., 2018). It is a same model as SNRes Net except that it is uses a larger generator (Table 4). For both models, We trained the generator 100k times and linearly decayed the learning rate over last 10k iterations. For EGAN-Ent-VI (Dai et al., 2017), we used an additional decoder equipped with convolution layers. We used ndis = 1, ngen = 1, ndec = 5 for SNDCGAN, and ndis = 1, ngen = 5, ndec = 5 for SNRes Net. The table below is the list of the choices of the hyperparameters (λ, d) in equation (13). we used in our comparative study, sorted by models.

Table 7: List of the hyperparameter choices.

Method objective ndis λ d

SNDCGAN + DC reg (CIFAR-10) GAN-variant2 1 3 0 SNDCGAN + DC reg (CIFAR-100) GAN-variant2 1 3 0 SNDCGAN-hinge + DC reg (CIFAR-10) GAN-hinge 1 3 0 SNDCGAN-hinge + DC reg (CIFAR-100) GAN-hinge 1 3 0 SNRes Net + DC reg (CIFAR-10) GAN-variant2 3 3 0 SNRes Net + DC reg (CIFAR-100) GAN-variant2 5 3 0 SNRes Net Large + DC reg (CIFAR-10) GAN-variant2 5 3 0 SNRes Net Large + DC reg (CIFAR-100) GAN-variant2 5 4 0 SNRes Net Large-hinge + DC reg (CIFAR-10) GAN-hinge 5 4 0.01 SNRes Net Large-hinge + DC reg (CIFAR-100) GAN-hinge 5 4 0.01

C.5 IMAGE GENERATION ON IMAGENET

The images used in this set of experiments were resized to 64 64 pixels. The details of the architecture are given in Table 5. For the optimization, we used Adam with the same hyperparameters we used for Res Net on CIFAR-10 and CIFAR-100 dataset. We trained the networks with 250K generator updates, and applied linear decay for the learning rate after 200K iterations so that the rate would be 0 at the end. We set λ = 6.0, d = 0.01 for the parameters in equation (13).

Published as a conference paper at ICLR 2019

D APPENDIX RESULTS

D.1 APPENDIX RESULT ON ARTIFICIAL DATA

8 6 4 2 0 2 4 6 8 8

(a) without DC regularization

8 6 4 2 0 2 4 6 8 8

(b) with DC regularization Fig 7: Generator samples on GMM ﬁtting by GANs (a) without and (b) with DC regularization

D.2 APPENDIX RESULT ON CIFAR-10 AND CIFAR-100

Experiment with different architectures

SNDCGAN SNDCGAN+DC-reg 0

SNRes Net SNRes Net+DC-reg 0

(a) CIFAR-10

SNDCGAN SNDCGAN+DC-reg 0

SNRes Net SNRes Net+DC-reg 0

30 GAN-vanilla GAN-variant1 GAN-variant2 GAN-hinge WGAN-GP LSGAN

(b) CIFAR-100 Fig 8: FIDs for unsupervised image generation on CIFAR-10 and CIFAR-100 (lower the better).

Experiment with different architectures on WGAN-GP

CIFAR-10 CIFAR-100 0

Inception score

WDCGAN-GP WDCGAN-GP + DC-reg WRes Net GAN-GP WRes Net GAN-GP + DC-reg

(a) Inception score(higher is better)

CIFAR-10 CIFAR-100 0

Inception score

WDCGAN-GP WDCGAN-GP + DC-reg WRes Net GAN-GP WRes Net GAN-GP + DC-reg

(b) FID(lower is better)

Fig 9: Results of WGAN-GP

Experiment with varying prior dimension

Published as a conference paper at ICLR 2019

3 4 5 6 7 8 9 10 dimension of Z

baseline proposal

(a) low prior dimension

2000 4000 6000 8000 10000 dimension of Z

baseline proposal

(b) high prior dimension Fig 10: FID performance of DC regularization on low dimensional prior and high dimensional prior.

Comparison with other methods

Table 8: Inception scores and FIDs with unsupervised image generation on CIFAR-10 and CIFAR100. CIFAR-10 results for the models designated with are cited from (Miyato et al., 2018), and the CIFAR-10 results results with are cited from (Karras et al., 2018). For the details of the objective functions, see Appendix C.2.

Method Inception score FID CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100

Real data 11.24 14.79 7.6 8.94

EGAN-Ent-VI(SNDCGAN) 6.95 .08 6.62 .10 29.0 33.3 EGAN-Ent-VI(SNRes Net) 7.31 .12 6.67 .10 27.0 30.5 feature matching(SNDCGAN) 7.54 .10 7.71 .06 25.9 29.2

proposal SNDCGAN + DC reg 8.08 .12 8.12 .11 24.6 25.8 SNDCGAN-hinge + DC reg 7.70 .11 7.99 .09 24.7 26.1 SNRes Net + DC reg 8.27 .08 8.27 .13 24.3 24.6 SNResnet Large + DC reg 8.41 .10 8.20 .08 20.6 24.8 SNResnet Large-hinge + DC reg 8.29 .09 8.41 .11 19.5 23.6

baseline SNDCGAN 7.42 .08 7.74 .08 29.3 27.9 SNDCGAN-hinge 7.58 .12 7.57 .07 25.5 28.1 SNResnet Large-hinge 8.22 .05 7.54 .13 21.7 26.6 Progressive GANs 8.56 .06

Published as a conference paper at ICLR 2019

Image generation on CIFAR-10 and CIFAR-100

(a) image generation of CIFAR-10

(b) image generation of CIFAR-100 Fig 11: Image generation of CIFAR-10 and CIFAR-100

Published as a conference paper at ICLR 2019

D.3 APPENDIX RESULT ON IMAGENET

(a) image generation of baseline model

(b) image generation of proposal model

Fig 12: Image generation of Image Net