# fgans_in_an_information_geometric_nutshell__93e27007.pdf

f-GANs in an Information Geometric Nutshell

Richard Nock , , Zac Cranko , Aditya Krishna Menon ,

Lizhen Qu , Robert C. Williamson ,

Data61, the Australian National University and the University of Sydney {firstname.lastname, aditya.menon, bob.williamson}@data61.csiro.au

Nowozin et al showed last year how to extend the GAN principle to all fdivergences. The approach is elegant but falls short of a full description of the supervised game, and says little about the key player, the generator: for example, what does the generator actually converge to if solving the GAN game means convergence in some space of parameters? How does that provide hints on the generator s design and compare to the ﬂourishing but almost exclusively experimental literature on the subject? In this paper, we unveil a broad class of distributions for which such convergence happens namely, deformed exponential families, a wide superset of exponential families . We show that current deep architectures are able to factorize a very large number of such densities using an especially compact design, hence displaying the power of deep architectures and their concinnity in the f-GAN game. This result holds given a sufﬁcient condition on activation functions which turns out to be satisﬁed by popular choices. The key to our results is a variational generalization of an old theorem that relates the KL divergence between regular exponential families and divergences between their natural parameters. We complete this picture with additional results and experimental insights on how these results may be used to ground further improvements of GAN architectures, via (i) a principled design of the activation functions in the generator and (ii) an explicit integration of proper composite losses link function in the discriminator.

1 Introduction

In a recent paper, Nowozin et al. [30] showed that the GAN principle [15] can be extended to the variational formulation of all f-divergences. In the GAN game, there is an unknown distribution P which we want to approximate using a parameterised distribution Q. Q is learned by a generator by ﬁnding a saddle point of a function which we summarize for now as f-GAN(P, Q), where f is a convex function (see eq. (7) below for its formal expression). A part of the generator s training involves as a subroutine a supervised adversary hence, the saddle point formulation called discriminator, which tries to guess whether randomly generated observations come from P or Q. Ideally, at the end of this supervised game, we want Q to be close to P, and a good measure of this is the f-divergence If(P Q), also known as Ali-Silvey distance [1, 12]. Initially, one choice of f was considered [15]. Nowozin et al. signiﬁcantly grounded the game and expanded its scope by showing that for any f convex and suitably deﬁned, then [30, Eq. 4]:

f-GAN(P, Q) If(P Q) . (1)

The inequality is an equality if the discriminator is powerful enough. So, solving the f-GAN game can give guarantees on how P and Q are distant to each other in terms of f-divergence. This elegant characterization of the supervised game unfortunately falls short of justifying or elucidating all parameters of the supervised game [30, Section 2.4], and the paper is also silent regarding a key part of the game: the link between distributions in the variational formulation and the generator, the

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

main player which learns a parametric model of a density. In doing so, the f-GAN approach and its members remain within an information theoretic framework that relies on divergences between distributions only [30]. In the GAN world at large, this position contrasts with other prominent approaches that explicitly optimize geometric distortions between the parameters or support of distributions [6, 14, 16, 21, 22], and raises the problem of connecting the f-GAN approach to any sort of information geometric optimization. One such information-theoretic/information-geometric identity is well known: The Kullback-Leibler (KL) divergence between two distributions of the same (regular) exponential family equals a Bregman divergence D between their natural parameters [2, 4, 7, 9, 35], which we can summarize as:

If KL(P Q) = D(θ ϑ) . (2)

Here, θ and ϑ are respectively the natural parameters of P and Q. Hence, distributions are points on a manifold on the right-hand side, a powerful geometric statement [4]; however, being restricted to KL divergence or "just" exponential families, it certainly falls short of the power to explain the GAN game. To our knowledge, the only generalizations known fall short of the f-divergence formulation and are not amenable to the variational GAN formulation [5, Theorem 9], [13, Theorem 3].

Our ﬁrst contribution is such an identity that connects the general If-divergence formulation in eq. (1) to the general D (Bregman) divergence formulation in eq. (2). We now brieﬂy state it, postponing the details to Section 3:

f-GAN(P, escort(Q)) = D(θ ϑ) + Penalty(Q) , (3)

for P and Q (with respective parameters θ and ϑ) which happen to lie in a superset of exponential families called deformed exponential families, that have received extensive treatment in statistical physics and differential information geometry over the last decade [3, 25]. The right-hand side of eq. (3) is the information geometric part [4], in which D is a Bregman divergence. Therefore, the f-GAN problem can be equivalent to a geometric optimization problem [4], like for the Wasserstein GAN and its variants [6]. Notice also that Q appears in the game in the form of an escort [5]. The difference vanish only for exponential families (escort(Q) = Q, Penalty(Q) = 0 and f = KL).

Our second contribution drills down into the information-theoretic and information-geometric parts of (3). In particular, from the former standpoint, we completely specify the parameters of the supervised game, unveiling a key parameter left arbitrary in [30] (explicitly incorporating the link function of proper composite losses [32]). From the latter standpoint, we show that the standard deep generator architecture is powerful at modelling complex escorts of any deformed exponential family, factorising a number of escorts in order of the total inner layers dimensions, and this factorization happens for an especially compact design. This hints on a simple sufﬁcient condition on the activation function to guarantee the escort modelling, and it turns out that this condition is satisﬁed, exactly or in a limit sense, by most popular activation functions (ELU, Re LU, Softplus, ...). We also provide experiments1 that display the uplift that can be obtained through a principled design of the activation function (generator), or tuning of the link function (discriminator).

Due to the lack of space, a supplement (SM) provides the proof of the results in the main ﬁle and additional experiments. A longer version with a more exhaustive treatment of related results is available [27]. The rest of this paper is as follows. Section 2 presents deﬁnition, 3 formally presents eq. (3), 4 derives consequences for deep learning, 5 completes the supervised game picture of [30], Section 6 presents experiments and a last Section concludes.

2 Deﬁnitions

Throughout this paper, the domain X of observations is a measurable set. We begin with two important classes of distortion measures, f-divergences and Bregman divergences.

Deﬁnition 1 For any two distributions P and Q having respective densities P and Q absolutely continuous with respect to a base measure µ, the f-divergence between P and Q, where f : R+ R is convex with f(1) = 0, is

If(P Q) .= EX Q

X Q(x) f P(x)

dµ(x) . (4)

1The code used for our experiments is available through https://github.com/qulizhen/fgan_info_geometric

For any convex differentiable ϕ : Rd R, the (ϕ-)Bregman divergence between θ and ϱ is:

Dϕ(θ ϱ) .= ϕ(θ) ϕ(ϱ) (θ ϱ) ϕ(ϱ) , (5) where ϕ is called the generator of the Bregman divergence.

f-divergences are the key distortion measure of information theory, while Bregman divergences are the key distortion measure of information geometry. A distribution P from a (regular) exponential family with cumulant C : Θ R and sufﬁcient statistics φ : X Rd has density PC(x|θ, φ) .= exp(φ(x) θ C(θ)), where Θ is a convex open set, C is convex and ensures normalization on the simplex (we leave implicit the associated dominating measure [3]). A fundamental Theorem ties Bregman divergences and f-divergences: when P and Q belong to the same exponential family, and denoting their respective densities PC(x|θ, φ) and QC(x|ϑ, φ), it holds that IKL(P Q) = DC(ϑ θ). Here, IKL is Kullback-Leibler (KL) f-divergence (f .= x 7 x log x). Remark that the arguments in the Bregman divergence are permuted with respect to those in eq. (2) in the introduction. This also holds if we consider f KL in eq. (2) to be the Csiszár dual of f [8], namely f KL : x 7 log x, since in this case If KL(P Q) = IKL(Q P) = DC(θ ϑ). We made this choice in the introduction for the sake of readability in presenting eqs. (1 3). We now deﬁne generalizations of exponential families, following [5, 13]. Let χ : R+ R+ be non-decreasing [25, Chapter 10]. We deﬁne the χ-logarithm, logχ, as logχ(z) .= R z 1 1 χ(t)dt. The χ-exponential is expχ(z) .= 1 + R z 0 λ(t)dt, where λ is deﬁned by λ(logχ(z)) .= χ(z). In the case where the integrals are improper, we consider the corresponding limit in the argument / integrand.

Deﬁnition 2 [5] A distribution P from a χ-exponential family (or deformed exponential family, χ being implicit) with convex cumulant C : Θ R and sufﬁcient statistics φ : X Rd has density given by Pχ,C(x|θ, φ) .= expχ(φ(x) θ C(θ)), with respect to a dominating measure µ. Here, Θ is a convex open set and θ is called the coordinate of P. The escort density (or χ-escort) of Pχ,C is

Pχ,C .= 1 Z χ(Pχ,C) , Z .= Z

X χ(Pχ,C(x|θ, φ))dµ(x) . (6)

Z is the escort s normalization constant.

We leaving implicit the dominating measure and denote P the escort distribution of P whose density is given by eq. (6). We shall name χ the signature of the deformed (or χ-)exponential family, and sometimes drop indexes to save readability without ambiguity, noting e.g. P for Pχ,C. Notice that normalization in the escort is ensured by a simple integration [5, Eq. 7]. For the escort to exist, we require that Z < and therefore χ(P) is ﬁnite almost everywhere. Such a requirement would be satisﬁed in the GAN game. There is another generalization of regular exponential families, known as generalized exponential families [13, 27]. The starting point of our result is the following Theorem, in which the information-theoretic part is not amenable to the variational GAN formulation.

Theorem 3 [5][36] for any two χ-exponential distributions P and Q with respective densities Pχ,C, Qχ,C and coordinates θ, ϑ, DC(θ ϑ) = EX Q[logχ(Qχ,C(X)) logχ(Pχ,C(X))].

We now brieﬂy frame the now popular (f-)GAN adversarial learning [15, 30]. We have a true unknown distribution P over a set of objects, e.g. 3D pictures, which we want to learn. In the GAN setting, this is the objective of a generator, who learns a distribution Qθ parameterized by vector θ. Qθ works by passing (the support of) a simple, uninformed distribution, e.g. standard Gaussian, through a possibly complex function, e.g. a deep net whose parameters are θ and maps to the support of the objects of interest. Fitting Q. involves an adversary (the discriminator) as subroutine, which ﬁts classiﬁers, e.g. deep nets, parameterized by ω. The generator s objective is to come up with arg minθ Lf(θ) with Lf(θ) the discriminator s objective: Lf(θ) .= sup ω {EX P[Tω(X)] EX Qθ[f (Tω(X))]} , (7)

where is Legendre conjugate [10] and Tω : X R integrates the classiﬁer of the discriminator and is therefore parameterized by ω. Lf is a variational approximation to a f-divergence [30]; the discriminator s objective is to segregate true (P) from fake (Q.) data. The original GAN choice, [15] f GAN(z) .= z log z (z + 1) log(z + 1) + 2 log 2 (8) (the constant ensures f(1) = 0) can be replaced by any convex f meeting mild assumptions.

3 A variational information geometric identity for the f-GAN game

We deliver a series of results that will bring us to formalize eq. (3). First, we deﬁne a new set of distortion measures, that we call KLχ divergences.

Deﬁnition 4 For any χ-logarithm and distributions P, Q having respective densities P and Q absolutely continuous with respect to base measure µ, the KLχ divergence between P and Q is deﬁned as KLχ(P Q) .= EX P logχ (Q(X)/P(X)) .

Since χ is non-decreasing, logχ is convex and so any KLχ divergence is an f-divergence. When χ(z) .= z, KLχ is the KL divergence. In what follows, base measure µ and absolute continuity are implicit, as well as that P (resp. Q) is the density of P (resp. Q). We now relate KLχ divergences vs f-divergences. Let f be the subdifferential of convex f and IP,Q .= [infx P(x)/Q(x), supx P(x)/Q(x)) R+ denote the range of density ratios of P over Q. Our ﬁrst result states that if there is a selection of the subdifferential which is upperbounded on IP,Q, the f-divergence If(P Q) is equal to a KLχ divergence.

Theorem 5 Suppose that P, Q are such that there exists a selection ξ f with sup ξ(IP,Q) < . Then χ : R+ R+ non decreasing such that If(P Q) = KLχ(Q P).

Theorem 5 essentially covers most if not all relevant GAN cases, as the assumption has to be satisﬁed in the GAN game for its solution not to be vacuous up to a large extent (eq. (7)). We provide a more complete treatment in the extended version [27]. The proof of Theorem 5 (in SM, Section I) is constructive: it shows how to pick χ which satisﬁes all requirements. It brings the following interesting corollary: under mild assumptions on f, there exists a χ that ﬁts for all densities P and Q. A prominent example of f that ﬁts is the original GAN choice for which we can pick

χGAN(z) .= 1 log 1 + 1

We now deﬁne a slight generalization of KLχ-divergences and allow for χ to depend on the choice of the expectation s X, granted that for any of these choices, it will meet the constraints to be R+ R+ and also increasing, and therefore deﬁne a valid signature. For any f : X R+, we denote KLχf (P Q) .= EX P h logχf(X) (Q(X)/P(X)) i , where for any p R+, χp(t) .= 1

p χ(tp). Whenever f = 1, we just write KLχ as we already did in Deﬁnition 4. We note that for any x X, χf(x) is increasing and non negative because of the properties of χ and f, so χf(x)(t) deﬁnes a χ-logarithm. We are ready to state a Theorem that connects KLχ-divergences and Theorem 3.

Theorem 6 Letting P .= Pχ,C and Q .= Qχ,C for short in Theorem 3, we have EX Q[logχ(Q(X)) logχ(P(X))] = KLχ Q( Q P) J(Q), with J(Q) .= KLχ Q( Q Q).

(Proof in SM, Section II) To summarize, we know that under mild assumptions relatively to the GAN game, f-divergences coincide with KLχ divergences (Theorems 5). We also know from Theorem 6 that KLχ. divergences quantify the geometric proximity between the coordinates of generalized exponential families (Theorem 3). Hence, ﬁnding a geometric (parameter-based) interpretation of the variational f-GAN game as described in eq. (7) can be done via a variational formulation of the KLχ divergences appearing in Theorem 6. Since penalty J(Q) does not belong to the GAN game (it does not depend on P), it reduces our focus on KLχ Q( Q P).

Theorem 7 KLχ Q( Q P) admits the variational formulation

KLχ Q( Q P) = sup

n EX P[T(X)] EX Q[( logχ Q) (T(X))] o , (10)

with R++ .= R\R++. Furthermore, letting Z denoting the normalization constant of the χ-escort of Q, the optimum T : X R++ to eq. (10) is T (x) = (1/Z) (χ(Q(x))/χ(P(x))).

(Proof in SM, Section III) Hence, the variational f-GAN formulation can be captured in an information-geometric framework by the following identity using Theorems 3, 5, 7.

Corollary 8 (the variational information-geometric f-GAN identity) Using notations from Theorems 6, 7 and letting θ (resp. ϑ) denote the coordinate of P (resp. Q), we have:

n EX P[T(X)] EX Q[( logχ Q) (T(X))] o = DC(θ ϑ) + J(Q) . (11)

We shall also name for short vig-f-GAN the identity in eq. (11). We note that we can drill down further the identity, expressing in particular the Legendre conjugate ( logχ Q) with an equivalent "dual" (negative) χ-logarithm in the variational problem [27]. The left hand-side of Eq. (11) has the exact same overall shape as the variational objective of [30, Eqs 2, 6]. However, it tells the formal story of GANs in signiﬁcantly greater details, in particular for what concerns the generator. For example, eq. (11) yields a new characterization of the generators convergence: because DC is a Bregman divergence, it satisﬁes the identity of the indiscernibles. So, solving the f-GAN game [30] can guarantees convergence in the parameter space (ϑ vs θ). In the realm of GAN applications, it makes sense to consider that P (the true distribution) can be extremely complex. Therefore, even when deformed exponential families are signiﬁcantly more expressive than regular exponential families [25], extra care should be put before arguing that complex applications comply with such a geometric convergence in the parameter space. One way to circumvent this problem is to build distributions in Q that factorize many deformed exponential families. This is one strong point of deep architectures that we shall prove next.

4 Deep architectures in the vig-f-GAN game

In the GAN game, distribution Q in eq. (11) is built by the generator (call it Qg), by passing the support of a simple distribution (e.g. uniform, standard Gaussian), Qin, through a series of non-linear transformations. Letting Qin denote the corresponding density, we now compute Qg. Our generator g : X Rd consists of two parts: a deep part and a last layer. The deep part is, given some L N, the computation of a non-linear transformation φL : X Rd L as

Rdl φl(x) .= v(Wlφl 1(x) + bl) , l {1, 2, ..., L} , (12) φ0(x) .= x X . (13)

v is a function computed coordinate-wise, such as (leaky) Re LUs, ELUs [11, 17, 23, 24], Wl Rdl dl 1, bl Rdl. The last layer computes the generator s output from φL: g(x) .= v OUT(ΓφL(x)+ β), with Γ Rd d L, β Rd; in general, v OUT = v and v OUT ﬁts the output to the domain at hand, ranging from linear [6, 20] to non-linear functions like tanh [30]. Our generator captures the high-level features of some state of the art generative approaches [31, 37].

To carry our analysis, we make the assumption that the network is reversible, which is going to reguire that v OUT, Γ, Wl (l {1, 2, ..., L}) are invertible. We note that popular examples can be invertible (e.g. DCGAN, if we use µ-Re LU, dimensions match and fractional-strided convolutions are invertible). At this reasonable price, we get in closed form the generator s density and it shows the following: for any continuous signature χnet, there exists an activation function v such that the deep part in the network factors as escorts for the χnet-exponential family. Let 1i denote the ith canonical basis vector.

Theorem 9 v OUT, Γ, Wl invertible (l {1, 2, ..., L}), for any continuous signature χnet, there exists activation v and bl Rd ( l {1, 2, ..., L}) such that for any output z, letting x .= g 1(z), Qg(z) factorizes as Qg(z) = (Qin(x)/ Qdeep(x)) 1/(Hout(x) Znet), with Znet > 0 a constant, Hout(x) .= Qd i=1 |v OUT(γ i φL(x) + βi)|, γi .= Γ 1i, and (letting wl,i .= W l 1i):

Qdeep(x) .=

i=1 Pχnet,bl,i(x|wl,i, φl 1) . (14)

Name v(z) χ(z) Re LU( ) max{0, z} 1z>0

Leaky-Re LU( ) z if z > 0 ϵz if z 0

1 if z > δ 1 ϵ if z δ

(α, β)-ELU( ) βz if z > 0 α(exp(z) 1) if z 0

β if z > α z if z α

prop-τ ( ) k + τ (z)

τ 1 (τ ) 1(τ (0)z)

τ (0) Softplus( ) k + log2(1 + exp(z)) 1 log 2 1 2 z

µ-Re LU( ) k + z+

0 if z < 1 (1 + z)2 if z [ 1, 1] 4z if z > 1

2 z if z < 4 4 if z > 4

Table 1: Some strongly/weakly admissible couples (v, χ). ( ) : 1. is the indicator function; ( ) : δ 0, 0 < ϵ 1 and dom(v) = [δ/ϵ, + ). ( ) : β α > 0; ( ) : is Legendre conjugate; ( ) : µ [0, 1). Shaded: prop-τ activations; k is a constant (e.g. such that v(0) = 0) (see text).

(Proof in SM, Section IV) The relationship between the inner layers of a deep net and deformed exponential families (Deﬁnition 2) follows from the theorem: rows in Wls deﬁne coordinates, φl deﬁne "deep" sufﬁcient statistics, bl are cumulants and the crucial part, the χ-family, is given by the activation function v. Notice also that the bls are learned, and so the deformed exponential families normalization is in fact learned and not speciﬁed. We see that Qdeep factors escorts, and in number. What is notable is the compactness achieved by the deep representation: the total dimension of all deep sufﬁcient statistics in Qdeep (eq. (14)) is L d. To handle this, a shallow net with a single inner layer would require a matrix W of space Ω(L2 d2). The deep net g requires only O(L d2) space to store all Wls. The proof of Theorem 9 is constructive: it builds v as a function of χ. In fact, the proof also shows how to build χ from the activation function v in such a way that Qdeep factors χ-escorts. The following Lemma essentially says that this is possible for all strongly admissible activations v.

Deﬁnition 10 Activation function v is strongly admissible iff dom(v) R+ = and v is C1, lowerbounded, strictly increasing and convex. v is weakly admissible iff for any ϵ > 0, there exists vϵ strongly admissible such that ||v vϵ||L1 < ϵ, where ||f||L1 .= R |f(t)|dt.

Lemma 11 The following holds: (i) for any strongly admissible v, there exists signature χ such that Theorem 9 holds; (ii) (γ,γ)-ELU (for any γ > 0), Softplus are strongly admissible. Re LU is weakly admissible.

(proof in SM, Section V) The proof uses a trick for Re LU which can easily be repeated for (α, β)- ELU, and for leaky-Re LU, with the constraint that the domain has to be lowerbounded. Table 1 provides some examples of strongly / weakly admissible activations. It includes a wide class of so-called "prop-τ activations", where τ is negative a concave entropy, deﬁned on [0, 1] and symmetric around 1/2 [29]. This concludes our treatment of the information geometric part of the vig-f-GAN identity. We now complete it with a treatment of its information-theoretic part.

5 A complete proper loss picture of the supervised GAN game

In their generalization of the GAN objective, Nowozin et al. [30] leave untold a key part of the supervised game: they split in eq. (7) the discriminator s contribution in two, Tω = gf Vω, where Vω : X R is the actual discriminator, and gf is essentially a technical constraint to ensure that Vω(.) is in the domain of f . They leave the choice of gf "somewhat arbitrary" [30, Section 2.4]. We now show that if one wants the supervised loss to have the desirable property to be proper composite [32]2, then gf is not arbitrary. We proceed in three steps, ﬁrst unveiling a broad class of proper f-GANs that deal with this property. The initial motivation of eq. (7) was that the inner maximisation may be seen as the f-divergence between P and Qθ [26], Lf(θ) = If(P Qθ). In fact, this variational

2informally, Bayes rule realizes the optimum and the loss accommodates for any real valued predictor.

representation of an f-divergence holds more generally: by [33, Theorem 9], we know that for any convex f, and invertible link function Ψ: (0, 1) R, we have:

inf T : X R E (X,Y) D [ℓΨ(Y, T(X))] = 1

2 If(P Q) (15)

where D is the distribution over (observations {fake, real}) and the loss function ℓΨ is deﬁned by:

ℓΨ(+1, z) .= f Ψ 1(z) 1 Ψ 1(z)

; ℓΨ( 1, z) .= f f Ψ 1(z) 1 Ψ 1(z)

assuming f differentiable. Note now that picking Ψ(z) = f (z/(1 z)) with z .= T(x) and simplifying eq. (15) with P[Y = fake] = P[Y = real] = 1/2 in the GAN game yields eq. (7). For other link functions, however, we get an equally valid class of losses whose optimisation will yield a meaningful estimate of the f-divergence. The losses of eq. (16) belong to the class of proper composite losses with link function Ψ [32]. Thus (omitting parameters θ, ω), we rephrase eq. (7) and refer to the proper f-GAN formulation as inf Q LΨ(Q) with (ℓis as per eq. (16)):

LΨ(Q) .= sup T : X R

E X P [ ℓΨ(+1, T(X))] + E X Q [ ℓΨ( 1, T(X))] . (17)

Note also that it is trivial to start from a suitable proper composite loss, and derive the corresponding generator f for the f-divergence as per eq. (15). Finally, our proper composite loss view of the f-GAN game allows us to elicitate gf in [30]: it is the composition of f and Ψ in eq. (16). The use of proper composite losses as part of the supervised GAN formulation sheds further light on another aspect the game: the connection between the value of the optimal discriminator, and the density ratio between the generator and discriminator distributions. Instead of the optimal T (x) = f (P(x)/Q(x)) for eq. (7) [30, Eq. 5], we now have with the more general eq. (17) the result T (x) = Ψ((1 + Q(x)/P(x)) 1). We now show that proper f-GANs can easily be adapted to eq. (11). We let χ (t) .= 1/χ 1(1/t).

Theorem 12 For any χ, deﬁne ℓx( 1, z) .= log(χ ) 1 Q(x) ( z), and let ℓ(+1, z) .= z. Then

LΨ(Q) in eq. (17) equals eq. (11). Its link in eq. (17) is Ψx(z) = 1/χ Q(x) (z/(1 z)).

(Proof in SM, Section VI) Hence, in the proper composite view of the vig-f-GAN identity, the generator rules over the supervised game: it tempers with both the link function and the loss but only for fake examples. Notice also that when z = 1, the fake examples loss satisﬁes ℓx( 1, 1) = 0 regardless of x by deﬁnition of the χ-logarithm.

6 Experiments

Two of our theoretical contributions are (A) the fact that on the generator s side, there exists numerous activation functions v that comply with the design of its density as factoring escorts (Lemma 11), and (B) the fact that on the discriminator s side, the so-called output activation function gf of [30] aggregates in fact two components of proper composite losses, one of which, the link function Ψ, should be a ﬁne knob to operate (Theorem 12). We have tested these two possibilities with the idea that an experimental validation should provide substantial ground to be competitive with mainstream approaches, leaving space for a ﬁner tuning in speciﬁc applications. Also, in order not to mix their effects, we have treated (A) and (B) separately.

Architectures and datasets We provide in SM (Section VI) the detail of all experiments. To summarize, we consider two architectures in our experiments: DCGAN [31] and the multilayer feedforward network (MLP) used in [30]. Our datasets are MNIST [19] and LSUN tower category [38].

Comparison of varying activations in the generator (A) We have compared µ-Re LUs with varying µ in [0, 0.1, ..., 1] (hence, we include Re LU as a baseline for µ = 1), the Softplus and the Least Square Unit (LSU, Table 1) activation (Figure 1). For each choice of the activation function, all inner layers of the generator use the same activation function. We evaluate the activation functions by using both DCGAN and the MLP used in [30] as the architectures. As training divergence, we adopt both GAN [15] and Wasserstein GAN (WGAN, [6]). Results are shown in Figure 1 (left).

µ-Re LU Softplus / LS / Re LU Discriminator: varying link

Figure 1: Summary of our results on MNIST, on experiment A (left+center) and B (right). Left: comparison of different values of µ for the µ-Re LU activation in the generator (Re LU = 1-Re LU, see text). Thicker horizontal dashed lines present the Re LU average baseline: for each color, points above the baselines represent values of µ for which Re LU is beaten on average. Center: comparison of different activations in the generator, for the same architectures as in the left plot. Right: comparison of different link function in the discriminator (see text, best viewed in color).

Three behaviours emerge when varying µ: either it is globally equivalent to Re LU (GAN DCGAN) but with local variations that can be better (µ = 0.7) or worse (µ = 0), or it is almost consistently better than Re LU (WGAN MLP) or worse (GAN MLP). The best results were obtained for GAN DCGAN, and we note that the Re LU baseline was essentially beaten for values of µ yielding smaller variance, and hence yielding smaller uncertainty in the results. The comparison between different activation functions (Figure 1, center) reveals that (µ-)Re LU performs overall the best, yet with some variations among architectures. We note in particular that, in the same way as for the comparisons intra µ-Re LU (Figure 1, left), Re LU performs relatively worse than the other criteria for WGAN MLP, indicating that there may be different best ﬁt activations for different architectures, which is good news. Visual results on LSUN (SM, Table A5) also display the quality of results when changing the µ-Re LU activation.

Comparison of varying link functions in the discriminator (B) We have compared the replacement of the sigmoid function by a link which corresponds to the entropy which is theoretically optimal in boosting algorithms, Matsushita entropy [18, 28], for which ΨMAT(z) .= (1/2) (1 + z/

1 + z2). Figure 1 (right) displays the comparison Matsushita vs "standard" (more speciﬁcally, we use sigmoid in the case of GAN [30], and none in the case of WGAN to follow current implementations [6]). We evaluate with both DCGAN and MLP on MNIST (same hyperparameters as for generators, Re LU activation for all hidden layer activation of generators). Experiments tend to display that tuning the link may indeed bring additional uplift: for GANs, Matsushita is indeed better than the sigmoid link for both DCGAN and MLP, while it remains very competitive with the no-link (or equivalently an identity link) of WGAN, at least for DCGAN.

7 Conclusion

It is hard to exaggerate the success of GAN approaches in modelling complex domains, and with their success comes an increasing need for a rigorous theoretical understanding [34]. In this paper, we complete the supervised understanding of the generalization of GANs introduced in [30], and provide a theoretical background to understand its unsupervised part, showing in particular how deep architectures can be powerful at tackling the generative part of the game. Experiments display that the tools we develop may help to improve further the state of the art.

8 Acknowledgments

The authors thank the reviewers, Shun-ichi Amari, Giorgio Patrini and Frank Nielsen for numerous comments.

[1] S.-M. Ali and S.-D.-S. Silvey. A general class of coefﬁcients of divergence of one distribution from another. Journal of the Royal Statistical Society B, 28:131 142, 1966.

[2] S.-I. Amari. Differential-Geometrical Methods in Statistics. Springer-Verlag, Berlin, 1985. [3] S.-I. Amari. Information Geometry and Its Applications. Springer-Verlag, Berlin, 2016. [4] S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000. [5] S.-I. Amari, A. Ohara, and H. Matsuzoe. Geometry of deformed exponential families: Invariant, dually-ﬂat and conformal geometries. Physica A: Statistical Mechanics and its Applications, 391:4308 4319, 2012. [6] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. Co RR, abs/1701.07875, 2017. [7] K. S. Azoury and M. K. Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. MLJ, 43(3):211 246, 2001. [8] A. Ben-Tal, A. Ben-Israel, and M. Teboulle. Certainty equivalents and information measures: Duality and extremal principles. J. of Math. Anal. Appl., pages 211 236, 1991. [9] J.-D. Boissonnat, F. Nielsen, and R. Nock. Bregman voronoi diagrams. DCG, 44(2):281 307, 2010. [10] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004. [11] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). In 4th ICLR, 2016. [12] I. Csiszár. Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2:299 318, 1967. [13] R.-M. Frongillo and M.-D. Reid. Convex foundations for generalized maxent models. In 33rd Max Ent, pages 11 16, 2014. [14] A. Genevay, G. Peyré, and M. Cuturi. Sinkhorn-autodiff: Tractable Wasserstein learning of generative models. Co RR, abs/1706.00292, 2017. [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS*27, pages 2672 2680, 2014. [16] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A.-C. Courville. Improved training of wasserstein GANs. Co RR, abs/1704.00028, 2017. [17] R.-H.-R. Hahnloser, R. Sarpeshkar, M.-A. Mahowald, R.-J. Douglas, and H.-S. Seung. Digital selection and analogue ampliﬁcation coexist in a cortex-inspired silicon circuit. Nature, 405:947 951, 2000. [18] M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. J. Comp. Syst. Sc., 58:109 128, 1999. [19] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [20] H. Lee, R. Ge, T. Ma, A. Risteski, and S. Arora. On the ability of neural nets to express distributions. Co RR, abs/1702.07028, 2017. [21] Y. Li, K. Swersky, and R.-S. Zemel. Generative moment matching networks. In 32nd ICML, pages 1718 1727, 2015. [22] S. Liu, O. Bousquet, and K. Chaudhuri. Approximation and convergence properties of generative adversarial learning. Co RR, abs/1705.08991, 2017. [23] A.-L. Maas, A.-Y. Hannun, and A.-Y. Ng. Rectiﬁer nonlinearities improve neural network acoustic models. In 30th ICML, 2013. [24] V. Nair and G. Hinton. Rectiﬁed linear units improve restricted Boltzmann machines. In 27th ICML, pages 807 814, 2010. [25] J. Naudts. Generalized thermostatistics. Springer, 2011. [26] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, Nov 2010. [27] R. Nock, Z. Cranko, A.-K. Menon, L. Qu, and R.-C. Williamson. f-GANs in an information geometric nutshell. Co RR, abs/1707.04385, 2017. [28] R. Nock and F. Nielsen. On the efﬁcient minimization of classiﬁcation-calibrated surrogates. In NIPS*21, pages 1201 1208, 2008. [29] R. Nock and F. Nielsen. Bregman divergences and surrogates for learning. IEEE Trans.PAMI, 31:2048 2059, 2009. [30] S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: training generative neural samplers using variational divergence minimization. In NIPS*29, pages 271 279, 2016. [31] A. Radford, L. Metz, and S. Chintala. unsupervised representation learning with deep convolutional generative adversarial networks. In 4th ICLR, 2016. [32] M.-D. Reid and R.-C. Williamson. Composite binary losses. JMLR, 11, 2010. [33] M.-D. Reid and R.-C. Williamson. Information, divergence and risk for binary experiments. JMLR, 12:731 817, 2011. [34] T. Salimans, I.-J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS*29, pages 2226 2234, 2016. [35] M. Telgarsky and S. Dasgupta. Agglomerative Bregman clustering. In 29 th ICML, 2012. [36] R.-F. Vigelis and C.-C. Cavalcante. On ϕ-families of probability distributions. J. Theor. Probab., 21:1 15, 2011. [37] L. Wolf, Y. Taigman, and A. Polyak. Unsupervised creation of parameterized avatars. Co RR, abs/1704.05693, 2017. [38] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. ar Xiv preprint ar Xiv:1506.03365, 2015.