# on_relativistic_fdivergences__f8e36c22.pdf

On Relativistic f-Divergences

Alexia Jolicoeur-Martineau 1

We take a more rigorous look at Relativistic Generative Adversarial Networks (RGANs) and prove that the objective function of the discriminator is a statistical divergence for any concave function f with minimal properties (f(0) = 0, f (0) = 0, supx f(x) > 0). We devise additional variants of relativistic f-divergences. We show that the Wasserstein distance is weaker than f-divergences which are weaker than relativistic f-divergences. Given the good performance of RGANs, this suggests that Wasserstein GAN does not performs well primarily because of the weak metric, but rather because of regularization and the use of a relativistic discriminator. We introduce the minimum-variance unbiased estimator (MVUE) for Relativistic GANs and show that it does not perform better. We show that the estimator of Relativistic average GANs (Ra GANs) is asymptotically unbiased and that the ﬁnitesample bias is small; removing this bias does not improve performance.

1. Introduction

Generative adversarial networks (GANs) (Goodfellow et al., 2014) are a very popular approach to approximately generate data from a complex probability distribution using only samples of data (without any information on the true data distribution). Most notably, it has been very successful at generating photo-realistic images (Karras et al., 2017; 2018). It consists in a game between two neural networks, the generator G and the discriminator D. The goal of D is to classify real from fake (generated) data. The goal of G is to generate fake data that appears to be real, thus fooling D into thinking that fake data is actually real.

There are many GAN variants and most of them con-

1Mila, Universit e de Montr eal . Correspondence to: Alexia Jolicoeur-Martineau <alexia.jolicoeurmartineau@mail.mcgill.ca>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

sist of changing the loss function of D. To name a few: Standard GAN (SGAN) (Goodfellow et al., 2014), Least Squares GAN (LSGAN) (Mao et al., 2017), Hinge-loss GAN (Hinge GAN) (Miyato et al., 2018), Wasserstein GAN (WGAN) (Arjovsky et al., 2017).

For most GAN variants, training D is equivalent to estimating a divergence: SGAN estimates the Jensen Shannon divergence (JSD), LSGAN estimates the Pearson χ2 divergence, Hinge GAN estimates the Reverse-KL divergence, and WGAN estimates the Wasserstein distance. Even more generally, f-GANs (Nowozin et al., 2016) estimate any f-divergence (which includes most of the popular divergences), while IPM-based GANs (Mroueh & Sercu, 2017) estimate any Integral probability metric (IPM) (M uller, 1997). Thus, intuitively, GANs can be thought of as estimating a diverge and then minimizing it (this is not technically correct; see Jolicoeur-Martineau (2018b)).

Recently, Jolicoeur-Martineau (2018a) showed that IPMbased GANs possess a unique type of discriminator which they call a Relativistic Discriminator (RD). They explained that one can construct f-GANs while using a RD and that doing so improves the stability of the training and quality of generated data. They called this approach Relativistic GANs (RGANs). They proposed two variants: Relativistic paired GANs (Rp GANs)1 and Relativistic Average GANs (Ra GANs).

Jolicoeur-Martineau (2018a) provided mathematical and intuitive arguments as to why using a Relativistic Discriminator (RD) may be helpful. However, they did not prove that the loss functions are mathematically sensible. Furthermore, the estimators that they used are not the minimum-variance unbiased estimators (MVUE).

The contributions of this paper are the following:

1. We prove that the objective functions of the discriminator in RGANs are divergences (relativistic fdivergences).

2. We devise additional variants of Relativistic fdivergences.

1We added the word paired to better distinguish the variant with paired real/fake data (originally called RGANs) and the general approach called Relativistic GANs (RGANs).

Relativistic f-divergences

3. We show that the Wasserstein Distance is weaker than f-divergences which are weaker than relativistic fdivergences.

4. We present the minimum-variance unbiased estimator (MVUE) of Rp GANs and show that using it hinders the performance of the generator.

5. We show that Ra GANs are only asymptotically unbiased, but that the ﬁnite-sample bias is small. Removing this bias does not improve the performance of the generator.

2. Background

For the rest of the paper, we will refer to the critic C(x) instead of the discriminator D(x). The critic is the discriminator before applying the activation function (D(x) = a(C(x)), where a is an activation function and C(x) R). Intuitively, the critic can be thought of as describing how realistic x is. In the case of SGAN and Hinge GAN, a large C(x) means that x is realistic, while a small C(x) means that x is not realistic. We use this notation because Relativistic GANs are deﬁned in terms of the critic rather than the discriminator.

2.1. Generative Adversarial Networks

GANs can be deﬁned very generally in the following way:

sup C:X R Ex P [f1(C(x))] + Ey Q [f2(C(y))] , (1)

sup G:Z X Ex P [g1(C(x))] + Ez Z [g2(C(G(z)))] , (2)

where f1, f2, g1, g2 : R R, P is the distribution of real data with support X, Z is the latent distribution (generally a multivariate normal distribution), C(x) is the critic evaluated at x, G(z) is the generator evaluated at z, and G(z) Q, where Q is the distribution of fake data. See Brock et al. (2018) for details on how different choices of Z performs. The critic and the generator are generally trained with stochastic gradient descent (SGD) in alternating steps.

Most GANs can be separated in two classes: non-saturating and saturating loss functions. GANs with the saturating loss are such that g1= f1 and g2= f2, while GANs with the non-saturating loss are such that g1=f2 and g2=f1. In this paper, we will assume that the non-saturating loss is used as it generally works best in practice (Goodfellow et al., 2014) (Nowozin et al., 2016). Note that g1 generally has no impact on training since its gradient with respect to G is zero; we can thus ignore it.

Although not always the case, the most popular GAN loss functions (SGAN, LSGAN with labels -1/1, Hinge GAN,

WGAN) are symmetric (i.e., f2(x) = f1( x)). For simplicity, in this paper, we restrict ourselves to symmetric loss functions.

Non-saturating Symmetric GANs (Sy GANs) can be represented more simply as:

sup C:X R Ex P [f(C(x))] + Ey Q [f( C(y))] , (3)

sup G:Z X Ez Z [f(C(G(z)))] , (4)

for some function f : R R. For easier optimization, we generally want f to be concave with respect to the critic. This is the case in symmetric f-GANs.

In this paper, we restrict our relativistic divergences to symmetric cases with concave f. Although this may be somewhat constraining, not making these assumptions would be very problematic for GANs. By not assuming concavity, we could have an objective function that diverges to inﬁnity (and thus an inﬁnite divergence). This is particularly problematic for GANs because early in training, we expect P and Q to be perfectly separated (because of fully disjoint supports). This would cause the objective function to explode towards inﬁnity and thereby causing severe instabilities. The Kullback Leibler (KL) divergence is a good example of such a problematic divergence for GANs. If a single sample from the support of Q is not part of the support of P, the divergence will be . Also, note that the dual form of the KL divergence cannot be represented as a Sy GAN with equation (3) since f1(x) = x and f2(x) = ex 1 are not symmetric (Nowozin et al., 2016).

2.2. Integral Probability Metrics

Rather than using a concave function f to ensure a maximum on the objective function, IPM-based GANs instead force the critic to respect some constraint so that it does not grow too quickly. IPM-based GANs are deﬁned in the following way:

sup C:X R C F

Ex P [C(x)] Ey Q [C(y)] , (5)

sup G:Z X Ez Z [C(G(z))] , (6)

where F is a class of functions such that the IPM is not inﬁnite. See Mroueh et al. (2017) for an extensive review of the choices of F.

2.3. Relativistic GANs

Rather than training the critic on real and fake data separately, Relativistic GANs tries to maximize the critic s difference (CD). In Relativistic paired GANs (Rp GANs), the CD is deﬁned as C(x) C(y), while in Relativistic average

Relativistic f-divergences

GANs (Ra GANs), the CD is deﬁned as C(x) E y QC(y)

(or vice-versa). The CD can be understood as how much more realistic real data is from fake data. The optimal size of the CD is determined by the choice of f. With a leastsquare loss, the CD must be exactly equal to 1. On the other hand, with a log-sigmoid loss, the CD is grown to around 2 or 3 (after-which the gradient of f vanishes to zero). This will be explained in more details in the next section. Again, we focus only on choices of f that have symmetry (as done with Sy GANs).

Relativistic paired GANs (Rp GANs) are deﬁned in the following way:

sup C:X R E x P y Q [f (C(x) C(y))] , (7)

sup G:Z X E x P z Z [f (C(G(z)) C(x))] . (8)

Relativistic average GANs (Ra GANs) are deﬁned in the following way:

sup C:X R E x P

f C(x) E y QC(y) +

f E x PC(x) C(y) , (9)

sup G:Z X E z Z

h f C(G(z)) E x PC(x) i +

f E z Pz C(G(z)) C(x) . (10)

3. Relativistic Divergences

We deﬁne statistical divergences in the following way:

Deﬁnition 3.1. Let P and Q be probability distributions and S be the set of all probability distributions with common support. A function D : (S, S) R>0 is a divergence if it respects the following two conditions:

D(P, Q) = 0 P = Q.

In other words, divergences are distances between probability distributions. The distribution of real data (P) is ﬁxed and our goal is to modify the distribution of fake data (Q) so that the divergence decreases over time through the training process.

It is important to show that we use a divergence; this ensures that it is not possible to obtain a critic which cannot distinguish real from fake sample (D(P, Q) = 0) when the

two distributions (real and fake) are not the same (P = Q). If we did not have a divergence, it could be possible to reach a situation where the generator cannot learn (since the critic returns the same value for real and fake samples) while the generator still isn t generating samples from the real distribution.

3.1. Main Theorem

As discussed in the introduction, in most GANs, the objective function of the critic at optimum is a divergence. We show that the objective function of the critic in Rp GANs, Ra GANs, and other variants also estimate a divergence. The theorem is as follows:

Theorem 3.1. Let f : R R be a concave function such that f(0) = 0, f is differentiable at 0, f (0) = 0, supx f(x) = M > 0, and arg supx f(x) > 0. Let P and Q be probability distributions with support X. Let M = 1

2Q. Then, we have that

DRp f (P, Q) = sup C:X R 2 E x P y Q [f (C(x) C(y))]

DRa f (P, Q) = sup C:X R E x P

f C(x) E y QC(y) +

f E x PC(x) C(y)

DRalf f (P, Q) = sup C:X R 2 E x P

f C(x) E y QC(y)

DRc f (P, Q) = sup C:X R E x P

h f C(x) E m MC(m) i +

h f E m MC(m) C(y) i

are divergences.

We ask that the supremum of f(x) is reached at some positive x (or at ). This is purely to ensure that a larger CD can be interpreted as leading to a larger divergence (rather than the opposite). This does not reduce the generality of Theorem 3.1. If f(x) is maximized at x < 0, we have that g(x) = f( x) is maximized at x > 0 and one can simply use g instead of f.

We require that f is differentiable at zero and its derivative to be non-zero. This assumption may not be necessary, but it is needed for one of our main lemma which we use to prove that these objective functions are divergences.

Note that DRp f (P, Q) corresponds to Rp GANs, DRa f (P, Q)

corresponds to Ra GANs, DRalf f (P, Q) corresponds to a simpliﬁed one-way version of Ra GANs (Ralf GANs), and DRc f (P, Q) corresponds to a new type of RGAN called Relativistic centered GANs (Rc GANs). Ralf GANs are not particularly interesting as they simply represent a simpler ver-

Relativistic f-divergences

sion of Ra GANs. On the other hand, Rc GANs are interesting as they center the critic scores using the mean of the whole mini-batch (rather than the mean of only real or only fake mini-batch samples). This divergence also has similarities to the Jensen Shannon divergence (JSD) since the JSD is the sum of the KL-divergence between P and M to the KL-divergence between Q and M.

A logical extension to Rc GANs would be to standardize the critic scores; however, this would not lead to a divergence given that we could not control the size of the elements inside f. To make it a divergence, we would need a learnable scaling weight (as in batch norm (Ioffe & Szegedy, 2015)), but this would counter the effect of the standardization. Thus, standardizing and scaling would just correspond to an equivalent re-parametrization of DRc f .

A sketch of the proof can be found below; the full proof is found in Appendix A.

3.2. Sketch of the Proof

Although the four divergences need separate proofs, a similar framework is used in each of them. Each proof consists of three steps. For clarity of notation, let Df(P, Q) = sup C:X R F(P, Q, C, f) be the divergence, where F is any of

the objective functions in Theorem 3.1.

First, we show that Df(P, Q) 0. This is easily proven by taking the simplest possible choice of critic, which does not depend on the probability distributions, i.e., Cw(x) = k for all x. This critic always leads to f(0) and thus to a objective function equal to 0. This means that

Df(P, Q) = sup C:X R F(P, Q, C, f) F(P, Q, Cw, f) = 0.

Second, we show that P = Q = Df(P, Q) = 0. This step generally relies on Jensen s inequality (for concave functions) which we use to show that Df(P, P) 0. Given that Df(P, P) 0 and Df(P, P) 0, we have that Df(P, P) = 0.

Third, we show that Df(P, Q) = 0 = P = Q. This step is by far the most difﬁcult to prove. Instead of showing it directly, we instead prove it by contraposition, i.e., we show that P = Q = Df(P, Q) > 0. To prove this, we use the fact that if P = Q, there must be values of the probability density functions, p(x) and q(x) respectively, such that p(x) > q(x) (and vice versa). Let T = arg sup S P(S) Q(S), we know that this set is not empty. Note that when P and Q have probability density functions p(x) and q(x) respectively, we have that T = {x|p(x) > q(x)}. To make the proof as simple as

possible, we use the following sub-optimal critic:

( if x T 0 else,

where = 0. This critic function is very simple, but, as we will show, there exists a > 0 such that this leads to an objective function greater than 0 which means that the divergence is also greater than 0.

With this critic in mind, our goal is to transform the problem into the following:

Df(P, Q) = sup C:X R F(P, Q, C, f) F(P, Q, C , f)

where L( ) = af( ) + bf( ), for some a > 0 and b > 0 s.t. a > b. We have been able to show this with all divergences.

We want to ﬁnd a > 0 large enough so that the positive term (f( )) is big, but small enough so that the negative term (f( )) is not too big. The main caveat is that, by concavity, f( ) |f( )|. This means that the negative term is always bigger in absolute value than the positive term. This is problematic, since a could be be very close to b and we want af( ) > bf( ) to get L( ) > 0 which proves that we have a divergence. The solution is to choose to be very small. By continuity of the concave function, if we make small enough (very close to 0), we can reach a point where (f( ) f( )). In which case, if a = b + ϵ, we have that

L( ) = af( ) + bf( ) af( ) bf( )

= bf( ) + ϵf( ) bf( )

In the actual proof, we show that there always exists a δ > 0 small enough such that any (0, δ) leads to L( ) > 0. This concludes the sketch of the proof.

3.3. Subtypes of Divergences

Figure 1 shows three examples of concave f with the necessary properties to be used in relativistic divergences; they are the concave functions used in SGAN, LSGAN (with labels 1/-1), and Hinge GAN. Their respective mathematical functions are

f S(z) = log( sigmoid(z)) + log(2), (11)

f LS(z) = (z 1)2 + 1, (12)

f Hinge(z) = max(0, 1 z) + 1. (13)

Relativistic f-divergences

SGAN (Log Sigmoid)

f(C(x) C(y))

LSGAN (Least Squares)

f(C(x) C(y))

Hinge GAN (Hinge)

f(C(x) C(y))

Figure 1. Plots of f with respect to the critic s difference (CD) using three appropriate choices of f for relativistic divergences. The bottom gray line represents f(0) = 0; the divergence is zero if all CDs are zero. The above gray line represents the maximum of f; the divergence is maximized if all CDs leads to that maximum.

Interestingly, we see that they form three different types of functions. Firstly, we have functions that grow exponentially less as x increases and thus reach their supremum at . Secondly, we have functions that grow to a maximum and then forever decrease (thus penalizing large CDs). Thirdly, we have functions that grow to a maximum and then never change. SGAN is of the ﬁrst type, LSGAN is of the second, and Hinge GAN is of the third type.

This shows that for all three types, we have that the CD is only encouraged to grow until a certain point. With the ﬁrst type, we never truly force the CD to stop growing, but the gradients vanish to zero. Thus, SGD effectively prevents the CDs from growing above a certain level (sigmoid saturates at around 2 or 3).

It is useful to keep in mind that Figure 1 also represents the concave functions used for Sy GANs, in which case f applies to real and fake data separately (f(x) and f( y)).

3.4. Weakness of the Divergence

The paper by Arjovsky et al. (2017) on using the Wasserstein distance (and other IPMs) for GANs has been extremely inﬂuential. In this paper, the authors suggest that the Wasserstein distance is more appropriate than fdivergences for training a critic since it induces the weakest topology possible. Rather than giving a formal deﬁnition in terms of topologies, we use a simpler deﬁnition (as also done by Arjovsky et al. (2017)):

Deﬁnition 3.2. Let P be a probability distribution with support X, (Pn)n N be a sequence of distributions converging to P, and D1 and D2 be statistical divergences (per deﬁnition 3.1).

We say that D1 is weaker than D2 if we have that:

D2(Pn, P) 0 = D1(Pn, P) 0 (Pn)n N ,

but the converse is not true.

We say that D1 is a weakest distance if we have that:

D1(Pn, P) 0 Pn D P (Pn)n N ,

where D represents convergence in distribution.

Thus, intuitively, a weaker divergence can be thought of as converging more easily. Arjovsky et al. (2017) showed that the Wasserstein distance is a weakest divergence and that it is weaker than common f-divergences (as used in f-GANs and standard GANs). They also showed that the Wasserstein distance is continuous with respect to its parameters and they attributed this property to the weakness of the divergence.

Considering this argument, one would except that Ra GANs would be weaker than Rp GANs which would be weaker than Symmetric GANs since this is generally the order of their relative performance and stability (however, note that this is not always true and GANs can perform better than Ra GANs). Instead, we found the opposite relationship:

Theorem 3.2. Let P be a probability distribution with support S, (Pn)n N be a sequence of distributions converging to P, f : R R be a concave function such that f(0) = 0,

Relativistic f-divergences

f is differentiable at 0, f (0) = 0, supx f(x) = M > 0, and arg supx f(x) > 0. Then, we have that

DW f (P, Q) is weakest,

DW f (P, Q) is weaker than DSy f (P, Q),

DSy f (P, Q) is weaker than DRp f (P, Q),

DRp f (P, Q) is weaker than DRa f (P, Q),

where DW is the Wasserstein distance and DSy is the distance in Symmetric GANs (see equation 3).

The proof is in Appendix B.

Given the good performance of Ra GANs, this suggests that the argument made by Arjovsky et al. (2017) is insufﬁcient. It only focuses on a perfect sequence of converging distributions, but the generator training does not guarantee a converging sequence of fake data distributions. It ignores the complex dynamics and intricacies of the generator training, which are still not well understood. Furthermore, it assumes an optimal critic which is effectively unobtainable. In practice, obtaining a semi-optimal critic requires training the critic for multiple iterations before training the generator; this signiﬁcantly increase the computational time.

Furthermore, it has been found that WGAN does not provide a good approximation of the Wasserstein distance and that better approximations of the Wasserstein distance lead to worse GANs (Mallasto et al., 2019). This provides further argument towards the idea that the weakness of the divergence is not a good indicator of a good divergence for GANs. As previously suggested (Jolicoeur-Martineau, 2018a), we hypothesize that what make WGAN good for GANs are likely 1) the constraint of the critic (a Lipschitz critic) and 2) the use of a relativistic discriminator, rather than the weakness of the divergence.

4. Estimators

4.1. Rp GANs

To estimate Rp GANs, Jolicoeur-Martineau (2018b) used the following estimator2:

b DRp f (P, Q) = sup C:X R

i=1 [f(C(xi) C(yi))] ,

where x1, . . . , xk and y1, . . . , yk are samples from P and Q respectively.

Although this is an unbiased estimator of DRp f (P, Q), it is not the estimator with the minimal variance for a given mini-batch. Using the two-sample version (Lehmann,

2Note that they actually used 1

k instead of 2

k because of how they deﬁned the divergence.

1951) of the U-statistic theorem (Hoeffding, 1992) and given that the loss function is symmetric with respect to its arguments, one can show the following:

Corollary 4.1. Let P and Q be probability distributions with support X. Let x1, . . . , xk and y1, . . . , yk be i.i.d. samples from P and Q respectively. Then, we have that

b DRp f (P, Q) = sup C:X R

j=1 [f(C(xi) C(yj))]

is the minimum-variance unbiased estimator (MVUE) of DRp f (P, Q).

Although it is the MVUE, this estimator requires O(k2) operations instead of O(k). In the experiments, we will show that using this estimator does not lead to good performance. Given the quadratic scaling and lack of performance gain, it may not be worth using.

4.2. Ra GANs and Ralf GANs

The divergences of Ra GANs and Ralf GANs assume that one knows the true expectation of the critic of real and fake data. However, in practice, we can only estimate the expectation. Although never explicitly mentioned, (Jolicoeur Martineau, 2018a) simply replaced all expectations by the mini-batch mean:

where k is the size of the mini-batch.

Given the non-linear function applied after calculating the CD, the divergences of Ra GANs are biased with ﬁnite batch size k. This means that Ra GANs are only asymptotically unbiased. How large k must be for the bias to become negligible is unclear.

We attempted to ﬁnd a close form for the bias with f S, f LS, and f Hinge (equations 11, 12, 13 and Figure 1), but we were only able to ﬁnd a closed form with f LS. The bias with f LS has a simple form and can be removed, as shown below:

Corollary 4.2. Let P and Q be probability distributions

Relativistic f-divergences

with support X. Then, we have that

ˆσC(x) + ˆσC(y)

h C(xi) ˆµC(y) 1 2i

h ˆµC(x) C(yj) 1 2i

h C(xi) ˆµC(y) 1 2i!

inf C:X R 1 k

1 2 ˆσC(x) + 1

h (C(xi) ˆµC 1)2i

h (ˆµC C(yj) 1)2i

are unbiased estimator of DRa f LS(P, Q), DRalf f LS (P, Q), and DRc f LS(P, Q) respectively. Furthermore,

C(xi) + C(yi)

ˆσC(x) = 1 (k 1)

C(xi) ˆµC(x) 2 ,

ˆσC(y) = 1 (k 1)

C(yi) ˆµC(y) 2 .

See Appendix C for the proof. This means that we can estimate the loss functions in Ra LSGAN, Ralf LSGAN, and Rc LSGAN without bias. In the experiments, we will show that the bias is negligible with the usual choices of f (equations 11, 12, 13) and batch size (32 or higher).

5. Experiments

All experiments were done with the spectral GAN architecture for 32x32 images (Miyato et al., 2018) in Pytorch (Paszke et al., 2017). We used the standard hyperparameters: learning rate (lr) = .0002, batch size (k) = 32, and the ADAM optimizer (Kingma & Ba, 2014) with parameters (α1, α2) = (.50, .999). We trained the models for 100k iterations with one critic update per generator update. For the datasets, we used CIFAR-10 (50k training images from 10 categories) (Krizhevsky, 2009), Celeb A

(200k of face images from celebrities) (Liu et al., 2015) and CAT (10k images of cats) (Zhang et al., 2008). All models were trained using the same seed (seed=1) with a single GPU. To evaluate the quality of generated outputs, we used the Fr echet Inception Distance (FID) (Heusel et al., 2017). For a review of the different evaluation metrics for GANs, please see Borji (2018). CAT was preprocessed by cropping all images to the faces of the cats, removing outliers (faces hidden by background), and removing images smaller than 32x32. Celeb A images were center cropped to 160x160 before being resized to 32x32. See code for details; the code to reproduce the experiments is available on https://github.com/Alexia JM/relativistic-f-divergences.

We approximated the bias of Ra GANs and Rc GANs by estimating the real/fake critic mean from 320 samples rather than the 32 mini-batch samples. For f LS, we were able to calculate the true value of the bias (in expectation, see Corollary 4.2). Results on CIFAR-10 are shown in Figure 2.

For RAGANs, the approximation of the relative bias with f LS was correct from 4k iterations and onwards. For all choices of f, we observed the same pattern of low approximated relative bias which stabilized after a certain number of iterations. We suspect that this may be due to the important instabilities of the ﬁrst iterations when the discriminator is not optimal. At 15k iterations, all biases were stabilized. We calculated the average of the bias with different f starting at 15k iterations: .995 for the true relative bias with f LS, .996 for the approximated relative bias with f LS, .994 for the approximated relative bias with f S, and .997 for the approximated relative bias with f Hinge.

For Rc GANs, the approximation of the bias with f LS was correct from the very beginning of training. All biases were relatively stable over time with the exception of f S which increased linearly over time (up to around 1.05). We calculated the average of the bias with different f: 1.007 for the true relative bias with f LS, 1.007 for the approximated relative bias with f LS, 1.03 for the approximated relative bias with f S, and 1.007 for the approximated relative bias with f Hinge.

Overall, this shows that the bias in the estimators of Ra GANs and Rc GANs tends to be small. Furthermore, with the exception of f S, the bias is relatively stable over time. Thus, accounting for the bias, may not be necessary.

5.2. Divergences

To test the new relativistic divergences proposed (and verify whether removing the bias in Ra GANs is useful), we ran experiments on CIFAR-10 using f LS, on LSUN bedrooms

Relativistic f-divergences

0 5000 10000 15000 20000 25000

0.90 0.95 1.00 1.05 1.10

Bias over time in Ra GANs with a batch size of 32

Number of discriminator/generator iterations

Relative bias

Ra LSGAN (True bias) Ra LSGAN (Approximation) Ra SGAN (Approximation) Ra Hinge GAN (Approximation) Unbiased

0 5000 10000 15000 20000 25000

0.90 0.95 1.00 1.05 1.10

Bias over time in Rc GANs with a batch size of 32

Number of discriminator/generator iterations

Relative bias

Ra LSGAN (True bias) Ra LSGAN (Approximation) Ra SGAN (Approximation) Ra Hinge GAN (Approximation) Unbiased

Figure 2. Plots of the relative bias (i.e., the biased estimate divided by the unbiased estimate) of relativistic average and centered f-divergences estimators over training time on CIFAR-10 with a mini-batch size of 32. Approximations of the bias were made using 320 independent samples.

using f Hinge, and on CAT using f Hinge (these choices of f were arbitrary). Results are shown in Table 1.

Using the MVUE for Rp GAN resulted in the generator having a worse performance on CIFAR-10 with f LS (β = .37, p = .72), Celeb A with f Hinge (β = 2.08, p = .07), and CAT with f S (β = 4.02, p = .003). Similarly, using the unbiased estimator made the generator perform sightly worse for Ra LSGAN (β = 2.37, p = .04) and Rc LSGAN (β = 1.33, p = .05). These results are surprising as they suggest that using noisy or slightly biased estimators may

Table 1. Minimum (and standard deviation) of the FID calculated at 10k, 20k, ... , 100k iterations using different loss functions (see equations 11, 12, 13) and datasets.

CIFAR-10 Celeb A CAT Loss f LS f Hinge f S

GAN 31.1 (8) 15.3 (52) 15.2 (11) Rp GAN 31.5 (8) 16.7 (4) 12.9 (2) Rp GANMVUE 30.2 (12) 21.9 (3) 18.2 (3) Ra GAN 29.2 (7) 15.9 (5) 12.3 (1) Ra GANunbiased 30.3 (13) - - Rc GAN 31.7 (8) 18.1 (3) 16.5 (7) Rc GANunbiased 32.3 (9) - -

be beneﬁcial.

6. Conclusion

Most importantly, we proved that the objective function of the critic in RGANs is a divergence. In addition, we showed that f-divergences are weaker than relativistic fdivergences. Thus, the weakness of the topology induced by a divergence alone cannot explain why WGAN performs well. Finally, we took a closer look at the estimators or RGANs and found that 1) the estimator of Rp GANs used by Jolicoeur-Martineau (2018a) is not the minimumvariance unbiased estimator (MVUE) and 2) the estimators of Ra GANs and Ralf GANs are slightly biased with ﬁnite batch-sizes. Surprisingly, we found that neither using the MVUE with Rp GANs or using an unbiased estimator with Ra GANs and Ralf GANs improved the performance. On the contrary, using better estimators always slightly decreased the quality of generated samples. This suggests that using noisy estimates of the divergences may be beneﬁcial as a regularization mechanism. This could be explained by vanishing gradients when the discriminator becomes closer to optimality (Arjovsky & Bottou, 2017).

It still remains a mystery as to why Ra GANs are better than Rp GANs and the direct mechanism that leads to RGANs performing in a much more stable matter. Future work should attempt to better understand the effect of the critic s difference on training. Our experiments were limited to the generation of small images; thus, we encourage further experiments with the MVUE and the unbiased estimator of Ra LSGAN.

Arjovsky, M. and Bottou, L. Towards principled methods for training generative adversarial networks. ar Xiv preprint ar Xiv:1701.04862, 2017.

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein

Relativistic f-divergences

generative adversarial networks. In International Conference on Machine Learning, pp. 214 223, 2017.

Borji, A. Pros and cons of gan evaluation measures. ar Xiv preprint ar Xiv:1802.03446, 2018.

Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high ﬁdelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096, 2018.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 2672 2680. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/ 5423-generative-adversarial-nets.pdf.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a nash equilibrium. ar Xiv preprint ar Xiv:1706.08500, 2017.

Hoeffding, W. A class of statistics with asymptotically normal distribution. In Breakthroughs in Statistics, pp. 308 334. Springer, 1992.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015.

Jolicoeur-Martineau, A. The relativistic discriminator: a key element missing from standard gan. ar Xiv preprint ar Xiv:1807.00734, 2018a.

Jolicoeur-Martineau, A. Gans beyond divergence minimization. ar Xiv preprint ar Xiv:1809.02145, 2018b.

Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. ar Xiv preprint ar Xiv:1710.10196, 2017.

Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. ar Xiv preprint ar Xiv:1812.04948, 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Krizhevsky, A. Learning multiple layers of features from tiny images. 2009.

Lehmann, E. L. Consistency and unbiasedness of certain nonparametric tests. The Annals of Mathematical Statistics, pp. 165 179, 1951.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.

Mallasto, A., Mont ufar, G., and Gerolin, A. How well do wgans estimate the wasserstein metric? ar Xiv preprint ar Xiv:1910.03875, 2019.

Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Smolley, S. P. Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2813 2821. IEEE, 2017.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018.

Mroueh, Y. and Sercu, T. Fisher gan. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 2513 2523. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/ 6845-fisher-gan.pdf.

Mroueh, Y., Li, C.-L., Sercu, T., Raj, A., and Cheng, Y. Sobolev gan. ar Xiv preprint ar Xiv:1711.04894, 2017.

M uller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997.

Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 271 279. Curran Associates, Inc., 2016.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., De Vito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.

Zhang, W., Sun, J., and Tang, X. Cat head detection-how to effectively exploit shape and texture features. In European Conference on Computer Vision, pp. 802 816. Springer, 2008.