# moreauyosida_fdivergences__64687a37.pdf Moreau-Yosida f-divergences D avid Terj ek 1 Variational representations of f-divergences are central to many machine learning algorithms, with Lipschitz constrained variants recently gaining attention. Inspired by this, we define the Moreau Yosida approximation of f-divergences with respect to the Wasserstein-1 metric. The corresponding variational formulas provide a generalization of a number of recent results, novel special cases of interest and a relaxation of the hard Lipschitz constraint. Additionally, we prove that the so-called tight variational representation of fdivergences can be to be taken over the quotient space of Lipschitz functions, and give a characterization of functions achieving the supremum in the variational representation. On the practical side, we propose an algorithm to calculate the tight convex conjugate of f-divergences compatible with automatic differentiation frameworks. As an application of our results, we propose the Moreau Yosida f-GAN, providing an implementation of the variational formulas for the Kullback-Leibler, reverse Kullback-Leibler, χ2, reverse χ2, squared Hellinger, Jensen-Shannon, Jeffreys, triangular discrimination and total variation divergences as GANs trained on CIFAR-10, leading to competitive results and a simple solution to the problem of uniqueness of the optimal critic. 1. Introduction Variational representations of divergences between probability measures are central to many machine learning algorithms, such as generative adversarial networks (Nowozin et al., 2016), mutual information estimation (Belghazi et al., 2018) and maximization (Hjelm et al., 2019), and energybased models (Arbel et al., 2021). One class of such measures is the family of f-divergences (Csisz ar, 1963; Ali & Silvey, 1966; Csisz ar, 1967), generalizing the well-known 1Alfr ed R enyi Institute of Mathematics, Budapest, Hungary. Correspondence to: D avid Terj ek . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Kullback-Leibler divergence from information theory. Another is the family of optimal transport distances (Villani, 2008), including the Wasserstein-1 metric. In general, variational representations are supremums of integral formulas taken over sets of functions, such as the Donsker-Varadhan formula (Donsker & Varadhan, 1976) for the Kullback Leibler divergence or the Kantorovich-Rubinstein formula (Villani, 2008) for the Wasserstein-1 metric. Informally speaking, one can implement (Nowozin et al., 2016; Arjovsky et al., 2017) such a formula by constructing a realvalued neural network called the critic (or discriminator) taking samples from the two probability measures as inputs, which is then trained to maximize the integral formula in order to approximate the supremum, resulting in a learned proxy to the actual divergence of said probability measures. Implementing the Kantorovich-Rubinstein formula in such a way involves restricting the Lipschitz constant of the neural network (Gulrajani et al., 2017; Petzka et al., 2018; Miyato et al., 2018; Adler & Lunz, 2018; Terj ek, 2020), which effectively stabilizes the approximation procedure. Recently, Lipschitz regularization has been incorporated (Farnia & Tse, 2018; Zhou et al., 2019; Ozair et al., 2019; Song & Ermon, 2020; Arbel et al., 2021; Birrell et al., 2020) into learning algorithms based on variational formulas of divergences other than the Wasserstein-1 metric, leading to the same empirical effect and a number of theoretical benefits. Inspired by this, we study Lipschitz-constrained variational representations of f-divergences. We show that existing instances of such variants are special cases of the Moreau Yosida approximation of f-divergences with respect to the Wasserstein-1 metric. To any divergence and pair of probability measures corresponds a set of optimal critics, which are exactly those functions which achieve the supremum in the variational representation. An optimal critic corresponding to f-divergences is not Lipschitz in general (not even continuous). Since any function represented by a neural network is Lipschitz, when a neural network is trained to approximate such a divergence, its target , an optimal critic, will never be reached. We show that when the divergence is replaced by its Moreau-Yosida approximation, the corresponding optimal critics are all Lipschitz continuous with uniformly bounded Lipschitz constants, leading to a divergence which is easier to approximate in practice via neural networks. The approximation is parametrized by a Moreau-Yosida f-divergences pair of real numbers, one of which controls the sharpness of the approximation and the Lipschitz constant of optimal critics. The other controls the behavior of the approximation such that a special case induces a hard Lipschitz constraint in the variational representation, and other values induce only a Lipschitz penalty term. While instances of the former already appeared in the literature, the latter is novel to our paper. A special case reduces to a novel, unconstrained variational representation of the Wasserstein-1 metric. In order to prove these results, we first generalize the socalled tight variational representation of f-divergences to be taken over the space of Lipschitz functions or its quotient space, which is the subspace of functions vanishing at an arbitrary, fixed point. The latter leads to optimal critics being unique, having practical benefits. We additionally characterize the functions achieving the supremum in the variational representation. To apply the results, we propose an algorithm compatible with automatic differentiation frameworks to calculate the tight convex conjugate of f-divergences which in most cases does not admit a closed form, using Newton s method in the forward pass and implicit differentiation in the backward pass. Finally, to demonstrate the usefulness of our results, we propose the Moreau-Yosida f-GAN, and implement it for the task of generative modeling on CIFAR-10. The experiments show that it is beneficial to use the Moreau-Yosida approximation as a proxy for f-divergences, the novel cases of which often outperform the ones with the hard Lipschitz constraint. On the other hand, the representation over the quotient space leads to a simple solution for the problem of uniqueness of the optimal critic. To summarize, our contributions are a generalization of the tight variational representation of f-divergences between probability measures on compact metric spaces along with a characterization of functions achieving the supremum, a practical algorithm to calculate the tight convex conjugate of f-divergences compatible with automatic differentiation frameworks, variational formulas for the Moreau-Yosida approximation of f-divergences with respect to the Wasserstein-1 metric, including a relaxation of the hard Lipshcitz constraint and an unconstrained variational representation of the Wasserstein-1 metric, and the Moreau-Yosida f-GAN implementing the variational formulas for the Kullback-Leibler, reverse Kullback-Leibler, χ2, reverse χ2, squared Hellinger, Jensen-Shannon, Jeffreys, triangular discrimination and total variation divergences as GANs trained on CIFAR-10, leading to competitive performance. 2. Preliminaries 2.1. Notations Denote the extended reals R = R { }, the nonnegative reals R+, the extended nonnegative reals R+ = R+ . The indicator of a set A is denoted i A with i A(x) = 0 if x A and i A(x) = otherwise. Absolute continuity and singularity of measures is denoted and , the Radon-Nikodym derivative of a measure µ with respect to a nonnegative measure ν such that µ ν by dµ dν , the support of a measure µ by supp(µ), a property to hold almost everywhere with respect to a measure µ by µ-a.e. The relative interior of a subset A of a vector space is denoted relint A, which for subsets of R only differs from the interior for singletons whose relative interior is the singleton itself. 2.2. Convex analysis (Zalinescu, 2002) Given a topological vector space X, denote its topological dual by X , i.e. the set of real-valued continuous linear maps on X, which is a topological vector space itself, and the canonical pairing by , : X X R, which is the continuous bilinear map ((x, x ) x, x = x (x)). Given a function f : X R, the set dom f = {x X : f(x) < } is the effective domain of f. A function f is proper if dom f = and f(x) > for all x X, otherwise it is improper. For a convex function f : X R, its convex conjugate is f : X R defined by f (x ) = supx X{ x, x f(x)}, and its subdifferential at x X is the set f(x) = {x X | ˆx X : ˆx x, x f(ˆx) f(x)}. The biconjugate f of f is the conjugate of its conjugate f , i.e. f (x) = supx X { x, x f (x )}, which is equivalent to f if f is proper, convex and lower semicontinuous. In that case, the supremum of the biconjugate representation is achieved precisely at elements of f(x). Conversely, the supremum in the conjugate representation of f (x ) is achieved at elements of f (x ) = {x X | ˆx X : x, ˆx x f (ˆx ) f (x )}. 2.3. f-divergences Given a proper, convex and lower semicontinuous function1 φ : R R, a measure µ and a nonnegative measure ν on a measurable space X, the f-divergence Dφ(µ ν) of µ from ν is defined (Csisz ar, 1963; Ali & Silvey, 1966; Csisz ar, 1967; Borwein & Lewis, 1993; Csisz ar et al., 1999) as Z φ dµc dν dν + φ ( )µ+ s (X) φ ( )µ s (X). (1) Here, µc ν, µs ν are the absolutely continuous and singular parts of the Lebesgue decomposition of µ with 1Originally, f is used in place of φ (hence the name), but we reserve the symbol f for other functions. Moreau-Yosida f-divergences respect to ν, µ+ s , µ s 0 is the Jordan decomposition of the singular part, and φ ( ) = limx φ(x) x R. The well-known variational representation Dφ(µ ν) = sup f:X R Z fdµ Z φ fdν (2) can be obtained as the biconjugate of the mapping (µ Dφ(µ ν)). The so-called tight variational representation Dφ(µ ν) = sup f:X R inf sup f(X) φ ( ) γ Z φ + (f γ)dν + γ (3) with φ+ = φ + i R+ was obtained in Agrawal & Horel (2020) as the biconjugate of the mapping (µ Dφ(µ ν) + i P (X)(µ)) (already considered in Ruderman et al. (2012)), and is valid for pairs of probability measures µ, ν. 2.4. Wasserstein-1 distance (Villani, 2008) Given probability measures µ, ν on a metric space (X, d), the Wasserstein-1 distance of µ and ν is defined as W1(µ, ν) = inf π Π(µ,ν) Z d(x1, x2)dπ(x1, x2), (4) where Π(µ, ν) is the set of probability measures supported on the product space X X with marginals µ and ν. It has a well-known variational representation called the Kantorovich-Rubinstein formula W1(µ, ν) = sup f L 1 Z fdµ Z fdν , (5) f L = sup x,y X,x =y |f(x) f(y)| is the Lipschitz norm of f. The supremum is achieved by the so-called Kantorovich potentials f : X R, unique µ, ν-a.e. up to an additive constant. 2.5. Moreau-Yosida approximation Let (X, d) be a metric space and f : X R a proper function, and 0 < λ, α R constants. The Moreau-Yosida approximation of index λ and order α of f is defined (Jost & Li-Jost, 2008; Dal Maso, 1993) as fλ,α(x) = inf y X {f(y) + λd(x, y)α}. (7) It holds that f(x) = supλ>0 fλ,α(x) = limλ fλ,α(x), where f is the greatest lower semicontinuous function with f f. 3. Lipschitz representation of f-divergences In this work, we consider the set P(X) of probability measures on a compact metric space (X, d), which is itself a compact metric space with the metric W1, metrizing the weak convergence of measures. We prove that the tight variational representation of Dφ from Agrawal & Horel (2020) can be generalized in the sense that the supremum can be taken over the set Lip(X, x0) of Lipschitz continuous functions on X that vanish at an arbitrary base point x0 X. This is a strictly smaller set than the set of bounded and measurable functions over which the supremum was taken originally. To apply convex analytic techniques, we consider the duality between vector spaces of measures and Lipschitz functions. This aspect is detailed in Appendix 8.1. An important property of the choice of vector spaces is that the topology on the space of measures generalizes the usual weak convergence of probability measures (Hanin, 1999). Proofs and more precise statements of our propositions can be found in Appendix 8.2. Proposition 1. Given probability measures µ, ν P(X) and a proper, convex and lower semicontinuous function φ : R R strictly convex at 1 with φ(1) = 0 and 1 relint dom φ, the f-divergence Dφ has the equivalent variational representation Dφ(µ ν) = sup f Lip(X) Z fdµ D φ(f ν) = sup f Lip(X,x0) Z fdµ D φ(f ν) , (8) with the tight convex conjugate D φ( ν) : Lip(X) R being D φ(f ν) = sup µ P (X) Z fdµ Dφ(µ ν) = min sup f(X) φ ( ) γ Z φ + (f γ)dν + γ . (9) The conjugate D φ( ν) is a topical function (Mohebi, 2005), meaning that D φ(f + C ν) = D φ(f ν) + C and D φ(f1 ν) D φ(f2 ν) both hold for C R and f1 f2. Based on the constant additivity property, the substitution D φ(f ν) = R fdν + D φ f R fdν ν leads to sup f Lip(X) Z fdµ Z fdν D φ f Z fdν ν , reinterpreting the variational representation of Dφ(µ ν) as a penalized variant of maximum mean deviation. A closed form expression for D φ( ν) is available for the Kullback Leibler divergence with D KL(f ν) = log R efdν. We call functions f for which Dφ(µ ν) = R f dµ D φ(f ν) holds, i.e. f Dφ(µ ν), Csisz ar potentials of Moreau-Yosida f-divergences µ, ν. This is in analogy with Kantorovich potentials, which are similarly unique µ, ν-a.e. up to an additive constant. In the second variational representation in (8), the additive constant is unique since f(x0) = 0 must hold. The following result is built on Borwein & Lewis (1993, Theorem 2.10). Proposition 2. Given probability measures µ, ν P(X), a function f Lip(X) is a Csisz ar potential of µ, ν, i.e. Dφ(µ ν) = R f dµ D φ(f ν), if and only if there exists C R such that the conditions sup f (X) + C φ ( ), (10) dν (x) φ +(f (x) + C) ν-a.e. (11) supp(µs) {x X : f (x) + C = φ ( )} (12) hold. Such f are unique µ, ν-a.e. up to an additive constant. If φ is of Legendre type (Borwein & Lewis, 1993), then φ+ and φ + are both continuously differentiable on int dom φ+ and int dom φ +, respectively, while φ + is increasing, and invertible where its value is positive with its inverse given by the strictly increasing φ +. With these, the second condition is equivalent to f (x) + C = φ + dν (x) µc-a.e. (13) Informally, this means that f is the strictly increasing image of the likelihood ratio. One can then deduce from the Neyman-Pearson lemma (Reid & Williamson, 2011) that for the binary experiment of discriminating samples from µ and ν, the statistical test (x χ[τ, ](f (x))) is a most powerful test for any threshold τ R. Conversely to the above proposition, given ν P(X) and f Lip(X), the same conditions characterize the set of µ P(X) for which the supremum is achieved in the conjugate representation of D φ( ν), i.e. µ D φ(f ν). Denoting the optimal γ in (9) by γφ,ν(f), for any µ P(X) satisfying the conditions in Proposition 2 with C = γφ,ν(f) one has µ D φ(f ν). For the Kullback-Leibler divergence, this reduces to the softmax µ = 1 R ef dν ef ν. In case X is a finite set, this leads to a family of prediction functions obtained as gradients of D φ(f ν) (Blondel et al., 2020). We propose an algorithm for the practical evaluation of D φ( ν) when no closed form expression is available in the case when the support of ν is finite2 and φ is such that φ + is twice differentiable on int dom φ + with non-vanishing 2Such measures are dense in (P(X), W1). Algorithm 1 Calculate γφ,ν(f) and fγφ,ν(f) Input: f, ν Rn, φ : R R, 0 < ϵ, τ R if φ ( ) < then γ = max(f) φ ( ) + ϵ. else γ = ν, f end if repeat s = ν,(φ +) (f γ) +1 ν,(φ +) (f γ) γ = γ s until |s| < τ fγ = ν (φ +) (f γ) ν,(φ +) (f γ) second derivative. Assuming that f achieves its maximum on the support of ν and that γ achieving the minimum is unique, finding γ reduces to a finite dimensional problem, i.e. f, ν can be considered as elements of Rn with n being the number of elements of the support of ν. Based on Newton s method and the implicit function theorem, we propose Algorithm 1 to calculate γφ,ν(f) and its gradient3. Then, the conjugate can be calculated as D φ(f ν) = ν, φ +(f γφ,ν(f)) + γφ,ν(f). (14) The derivation of the algorithm can be found in Appendix 8.3, along with the corresponding functions φ+, φ + and their derivatives for the Kullback-Leibler, reverse Kullback-Leibler, χ2, reverse χ2, squared Hellinger, Jensen Shannon, Jeffreys and triangular discrimination divergences. For the Kullback-Leibler divergence, one has the closed form γφ,ν(f) = log R efdν. We found that exploiting the constant additivity property by calculating the conjugate as D φ(f ν) = D φ(f max(f) ν) + max(f) (15) is beneficial to avoid numerical instabilities. This can be seen as a generalization of the log-sum-exp trick. 4. Moreau-Yosida approximation of f-divergences Since the mapping (µ Dφ(µ ν)) from the metric space (P(X), W1) to R is proper and lower semicontinuous, it is an ideal candidate for Moreau-Yosida approximation, for which the infimum is always achieved since (P(X), W1) is compact if (X, d) is. Given 0 < λ, α R, the Moreau Yosida approximation of index λ and order α of Dφ( ν) with respect to W1 is therefore defined as Dφ,λ,α(µ ν) = min ξ P (X) {Dφ(ξ ν) + λW1(µ, ξ)α}. (16) 3 , and denote the standard dot product and the elementwise product in Rn. Moreau-Yosida f-divergences This is still a divergence in the sense that Dφ,λ,α(µ ν) 0 with equality if and only if µ = ν. The original divergence can be recovered as Dφ(µ ν) = supλ>0 Dφ,λ,α(µ ν) = limλ Dφ,λ,α(µ ν) for any α > 0. Moreover, for α 1, Dφ,λ,α( ν) is Lipschitz continuous with respect to W1. If α = 1, the Lipschitz constant is exactly λ. In some cases4, variational representations are available. Proposition 3. Given probability measures µ, ν P(X), λ > 0, α 1 and a proper, convex and lower semicontinuous function φ : R R strictly convex at 1 with φ(1) = 0 and 1 relint dom φ, the divergence Dφ,λ,α(µ ν) has the equivalent variational representation max f Lip(X,x0), f L λ Z fdµ D φ(f ν) (17) if α = 1, and max f Lip(X,x0) Z fdµ D φ(f ν) (α 1)α α 1 α λ 1 1 α f In the limit α 1, (18) converges to (17) in the sense that limα 1 (α 1)α α 1 α λ 1 1 α f α α 1 L = 0 if f L λ and otherwise, providing an unconstrained relaxation of the hard constraint f L λ. Choosing φ = i{1} (so that Dφ( ν) = i{ν} and D φ(f ν) = R fdν), one has Dφ,λ,α(µ ν) = λW1(µ, ν)α, leading to the following unconstrained variational representation of W1. Proposition 4. Given µ, ν P(X), λ > 0 and α > 1, W1(µ, ν) has the equivalent unconstrained variational representation 1 λ max f Lip(X,x0) Z fdµ Z fdν (α 1)α α 1 α λ 1 1 α f The maximum is achieved at αλW1(µ, ν)α 1f , with f being a Kantorovich potential of µ, ν. As stated, subgradients of the mapping (µ λW1(µ, ν)α) are nothing but the Kantorovich potentials f achieving the supremum in the Kantorovich-Rubinstein formula, scaled by the coefficient αλW1(µ, ν)α 1. This allows the characterization of subgradients of the mapping (µ Dφ,λ,α(µ ν)). 4Since the mapping (ξ λW1(µ, ξ)α) is neither convex nor concave if 0 < α < 1, we could not obtain a variational representation via Fenchel-Rockafellar duality in this case. Proposition 5. Given probability measures µ, ν P(X), λ > 0, α 1 and a proper, convex and lower semicontinuous function φ : R R strictly convex at 1 with φ(1) = 0 and 1 relint dom φ, let ξ P(X) be a probability measure achieving the minimum in (16), i.e. for which Dφ,λ,α(µ ν) = Dφ(ξ ν) + λW1(µ, ξ )α holds. Then there exists an f Lip(X) achieving the maximum in (17) if α = 1 or (18) if α > 1, which is a Csisz ar potential of ξ , ν and αλW1(µ, ξ )α 1 times a Kantorovich potential of µ, ξ at the same time. These imply that for any τ R, the mapping (x χ[τ, ](f (x))) is a most powerful test for discriminating samples from ξ and ν, and that f L = αλW1(µ, ξ )α 1. Informally, since ξ is close to µ in W1, the above mapping can be seen as a Lipschitz regularized version of a most powerful test for discriminating µ and ν. 2 4 6 8 10 12 14 16 0 Figure 1. Multiplier and exponent of f L Consider the reparametrization λ = 1 αβ α, so that (18) reduces to α β α,α(µ ν) = max f Lip(X,x0) D φ(f ν) α 1 α α 1 . (20) A plot of the respective values of the multiplier and the exponent for β = 1 and α [1, 16] are visualized in Figure 1. In the limit α , the multiplier and exponent both converge to 1. On the other hand, one has limα 1 α 1 α α 1 L = 0 if f L 1 and otherwise. An interesting special case is the limit α , resulting in the minimum of Dφ(ξ ν) with ξ P(X) ranging over the Moreau-Yosida f-divergences Wasserstein-1 ball of radius β centered at µ as lim α Dφ, 1 α β α,α(µ ν) = min ξ P (X),W1(ξ,µ) β{Dφ(ξ ν)} = max f Lip(X,x0) Z fdµ D φ(f ν) β f L This should be contrasted with (17) corresponding to the α = 1 case, which also has a hard constraint, but in the dual formula. Since the values of the above formulas for a given f are invariant for constant translations f +C, the supremums can equivalently be taken over Lip(X) instead of Lip(X, x0) in all cases. 5. Moreau-Yosida f-GAN We propose the Moreau-Yosida f-GAN (MYf-GAN) as an implementation of the variational formula of the Moreau Yosida regularization of Dφ with respect to W1. The function f in (17) or (18) is parametrized by a neural network called the critic, which is trained to maximize the formula inside the maximum, providing an approximation of the exact value of the divergence. One of the measures µ, ν is represented by the dataset, and the other by a neural network called the generator. The generator transforms samples from a fixed noise distribution into ones resembling the data distribution, and is trained to minimize the divergence approximated by the critic. Based on the reparametrized formula (20) with the substitution D φ(f ν) = R fdν + D φ f R fdν ν , the two minimax games are the following. First let µ be the generated distribution and ν be the data, resulting in the forward ( ) formulation min θg Rl max θf Rk E(ζn,νn) (Pz,Pd) gθg#ζn, fθf νn, fθf D φ(fθf νn, fθf νn) α (β fθf L,gˆ θg#ζn,νn) α α 1 . (22) Now let µ be the data and ν the generated distribution, leading to the reverse ( ) formulation min θg Rl max θf Rk E(µn,ζn) (Pd,Pz) µn, fθf gθg#ζn, fθf D φ(fθf gˆθg#ζn, fθf gˆθg#ζn) α (β fθf L,µn,gˆ θg#ζn) α α 1 . (23) The notation of the minimax games is the following. The functions f : X Rk R and g : Z Rl X are the critic and generator neural networks parametrized by weight vectors θf Rk and θg Rl, and fθf , gθg are shorthands for f( , θf), g( , θg). The latent space is Z = Rm. The sample space X Rn is a compact subset of Euclidean space equipped with the restriction of the metric induced by the Euclidean norm, e.g. X = [ 1, 1]3 32 32 for CIFAR-10. Pd P(X) denotes the data distribution and Pz P(Z) the noise distribution, e.g. a standard normal. Empirical measures (corresponding to minibatches) are denoted µn P, meaning that µn = 1 n Pn i=1 δxµ,i with (xµ,i) X being a realization of a sequence of n independent and identical copies of the random variable corresponding to P. The empirical measure corresponding to the generated distribution is obtained as the pushforward gθg#ζn of the latent empirical measure ζn (a minibatch of noise samples) through the generator gθg. Empirical means are denoted µn, f = 1 n Pn i=1 f(xµ,i). The conjugate D φ is calculated according to (14) using the stabilization trick (15). By ˆθg we denote a copy of θg, meaning that θg is not optimized to minimize terms containing the copy, i.e. the loss function of the generator is fθf , gθg#ζn . The term fθf L,µn,νn denotes a possibly data-dependent estimate of fθf L. The minimax games include the case limα α 1 α = limα α α 1 = 1. Lipschitz norm estimation. Rademacher s theorem (Weaver, 2018) states that if f L < for f : Rn R, then (x f(x) 2) = f L holds, i.e. that the supremum of the function mapping x Rn to the Euclidean norm of the gradient of f at x is equal to the Lipschitz norm of f. Based on this and the gradient penalty of Gulrajani et al. (2017), we propose for fθf L,µn,νn the estimator Eυn U[0,1) max x supp(υnµn+(1 υn)νn) fθf (x) 2 (24) giving a lower bound to fθf L. Here, U[0, 1) is the uniform distribution on [0, 1) from which an empirical measure υn = 1 n Pn i=1 δui is drawn, and unµn + (1 u)νn = 1 n Pn i=1 δuixµ,i+(1 ui)xν,i denotes the corresponding interpolation of µn and νn. This clearly biased estimator leaves room for improvement. Constructing an unbiased estimator would require assuming a distribution for the random variable representing the value of the gradient norm of the critic, which we leave for future work. Relaxation of hard Lipschitz constraint. We implement the hard constraint case α = 1 by replacing the last term in the minimax games with the one-sided gradient penalty (Gulrajani et al., 2017; Petzka et al., 2018) ℓEυn U[0,1) υnµn + (1 υn)νn, (max{0, fθf ( ) 2 β 1})2 with the coefficient 0 < ℓ R controlling the strength of the penalty. This is a widely used method to enforce the hard constraint fθf L β 1. We visualize the maximum, mean and minimum of minibatches of gradient norms of the critic during training in Figure 2 for α = 1 with the gradient penalty and α = 1.05 with the estimator detailed above. The α = 1 case does not enforce the hard constraint, since only the mean Moreau-Yosida f-divergences Table 1. MYf-GAN performance on CIFAR-10 β = 0 α = 1.05, β = 1 α = 2, β = 1 α = , β = 0.5 0.2 Dφ IS FID IS FID IS FID IS FID KULLBACK-LEIBLER 7.16 34.12 8.26 13.22 8.33 14.83 8.20 13.85 8.09 13.42 8.20 12.51 REVERSE KULLBACK-LEIBLER 8.33 12.97 8.30 13.27 8.34 13.24 8.17 13.13 8.09 15.26 χ2 8.18 14.17 8.26 13.36 8.37 13.36 8.23 12.95 8.27 13.46 REVERSE χ2 8.47 13.89 8.26 14.59 8.24 14.04 8.45 12.28 8.11 14.17 SQUARED HELLINGER 8.03 16.41 8.07 16.06 8.25 15.89 8.25 13.93 8.52 12.18 JENSEN-SHANNON 7.51 30.17 8.30 14.49 8.34 12.71 8.04 16.04 8.37 11.57 8.27 12.58 JEFFREYS 8.09 13.99 8.21 14.46 8.25 13.32 8.34 13.04 TRIANGULAR DISCRIMINATION 6.45 43.14 8.42 13.54 8.08 14.68 8.15 14.28 8.35 12.21 8.09 15.13 TOTAL VARIATION 7.41 31.09 8.12 15.44 8.28 14.61 8.08 13.53 8.12 13.77 8.05 14.60 TRIVIAL 8.07 15.97 8.04 14.75 6.67 36.48 2 4 6 8 10 104 Figure 2. f(X) 2 with relaxed Lipschitz constraint of the gradient norms is concentrated around β 1 = 1, and not their maximum. The α = 1.05 case, being a relaxation of the hard constraint, empirically behaves very similarly to an ideal hard constraint implementation, in the sense that the maximum of the gradient norms is concentrated around β 1 = 1. This is no surprise in light of Proposition 5, since f L = αλW1(µ, ξ )α 1 = β αW1(µ, ξ )α 1 = β (1+ϵ)W1(µ, ξ )ϵ is very close to β 1 in practice for small ϵ, such as ϵ = 0.05. We did not observe significant performance differences. This particular experiment used ℓ= 10 and φ corresponding to the Kullback-Leibler divergence, but we observed identical behavior in other hyperparameter settings as well with a range of α close to 1. We argue that using the relaxation with some α = 1 + ϵ is potentially beneficial for other applications requiring the satisfaction of a hard Lipschitz constraint. Choice of f-divergence. Quantitative results in terms of Inception Score (IS) and Fr echet Inception Distance (FID) can be seen in Table 1. Missing values in the unregularized case (β = 0) indicate divergent training, showing that regularization (β > 0) not only improves performance, but leads to convergent training even in cases when it does not seem possible without regularization. The TRIVIAL case indicates Dφ( ν) = i{ν}, so that the forward and reverse formulations are identical. In this case, Dφ, 1 α β α,α(µ ν) reduces to 1 αβ αW1(µ ν)α. If α > 1, this leads to an unconstrained formulation of the Wasserstein GAN corresponding to Proposition 4. The original, constrained Wasserstein GAN with gradient penalty led to an IS of 8.09 and an FID of 13.40 in our implementation. This is marginally better than the performance of the unconstrained variant as reported in Table 1. As shown in Figure 2, gradient penalty leads to a higher gradient norm than required by the hard constraint, which might lead to the observed marginal performance improvement. Indeed, increasing β leads to better performance for the unconstrained variant, e.g. β = 0.5 with α = 2 led to and IS of 8.14 and an FID of 13.33, which is in turn marginally better than the original, constrained variant. While it is hard to tell from these results which f-divergence is the best, it is definitely not the TRIVIAL one. Moreau-Yosida f-divergences 2 4 6 8 10 104 default quotient Figure 3. f(X) for default and quotient critic Quotient critic. To ensure that fθf Lip(X, x0), we simply modify the forward pass of the critic to return fθf (x) fθf (x0) instead of fθf (x). This induces negligible computational overhead since fθf (x0) can be calculated with a minibatch of size 1, with the choice of x0 being arbitrary, e.g. the zero vector in our implementation. We call this the quotient critic since Lip(X, x0) is isomorphic to the quotient space Lip(X) R . In Figure 3 we visualize the maximum, mean and minimum of the critic output over minibatches of generated samples during training. It is clear that the quotient critic solves the drifting of the output of the critic, which was found to hurt performance in some cases (Karras et al., 2018; Adler & Lunz, 2018). We observed only marginal performance improvement. Loss function of the generator. The reason for picking the penalized mean deviation form of the variational formulas for this application is that in the reverse case, we found that using gθg#ζn, fθf as the loss function of the generator leads to superior performance than using D φ(fθf gθg#ζn), which cripples performance in most cases. This suggests that gradients of the Csisz ar potential f might be of greater interest than the gradient of the conjugate D φ (f ν). The latter is a reweighting of the former, since the gradient of the conjugate is a probability distribution, such as the softmax for the Kullback-Leibler divergence. Optimal critic has bounded Lipschitz constant. Notice that while the variational formula of f-divergences contains a supremum, the formula of their Moreau-Yosida approximations contains a maximum. This means that in the former case, even if the divergence is finite, the supremum might not be achieved by a Lipschitz function. The variational representation (8) only implies that a sequence of Lipschitz functions converges to a function achieving the supremum, but the limit is not necessarily Lipschitz continuous, in fact it might not even be continuous. On the other hand, for the Moreau-Yosida approximation, the maximum in (17) or (18) is always achieved by a Lipschitz function. Since any neural network is Lipschitz continuous, we argue that a trained critic can provide a better estimate of the Moreau-Yosida approximation, since its target f is not only a Csisz ar potential of ξ , ν but a scaled Kantorovich potential of µ, ξ as well, implying that it has a bounded Lipschitz constant. 2,000 4,000 6,000 8,000 10,000 0 α = 1.05, β = 1 α = 2, β = 1 α = , β = 1 Figure 4. fθf L,µn,νn during training The α = 2 and α = cases. Since f is a Kantorovich potential scaled by the coefficient β αW1(µ, ξ )α 1 and the Lipschitz norm of a Kantorovich potential is 1, the case α > 1 can be seen as adaptive Lipschitz regularization, with f L decaying during training as µ and ν drift closer and W1(µ, ξ ) becomes smaller. We visualized fθf L,µn,νn in Figure 4 during training with α = 1.05, 2, and β = 1. Ideally, the Lipschitz norm of the critic would vanish. This can be observed in the α = case, which leads to finding a generated distribution with Wasserstein-1 distance of β = 1 from the data distribution, accordingly to (21). The best FID in Table 1 indicates that it can be beneficial to choose α = 2 even though the Lipschitz norm does not vanish. While the case α = (where we only consider the case ) leads to low performance with high values of β and unstable training with low values of β, we found that decaying β e.g. from 0.5 to 0.2 led to the best IS as can be seen in Table 15. The TRIVIAL case does not perform well in this setting, which is not surprising since the exact value of Dφ, 1 α β α,α(µ ν) is if ν is not contained in the W1 ball of radius β centered at µ, and 0 otherwise. 5Numerical instabilities prevented us from evaluating the Jeffreys divergence in this setting. Moreau-Yosida f-divergences Preliminary experiments showed that other values of α behave similarly to the ones we considered, which is why we restricted our attention to the representative values 1.05, 2 and . The implementation was done in Tensor Flow, using the residual critic and generator architectures from Gulrajani et al. (2017). Training was done for 100000 iterations, with 5 gradient descent step per iteration for the critic, and 1 for the generator. Additional results, details of the experimental setup and generated images can be found in Appendix 8.4, along with toy examples validating our approach for approximating f-divergences through the tight variational representations on categorical and Gaussian distributions. The original f-GAN losses (Nowozin et al., 2016) were particularly unstable in our implementation. Training the critic for 1 instead of 5 steps per iteration led to more stability, but even in this case only the χ2 divergence made it to 100000 iterations without numerical errors, leading to an IS of 6.49 and an FID of 40.64. Source code to reproduce the experiments is available at https://github.com/ renyi-ai/moreau-yosida-f-divergences. 6. Related work In Farnia & Tse (2018), Dφ,1,1 is defined, and a non-tight variational representation is given for symmetric choices of Dφ. They also prove that Dφ,1,1 between the data and generated distributions is a continuous function of the generator parameters, and provide a dual formula for the case α = 2 using W2 instead of W1. A future direction is to prove analogous results for general α, λ and Wp. In Birrell et al. (2020), a generalization of Dφ,1,1 is defined with arbitrary IPMs instead of W1, but their assumptions on φ are more restrictive, and they explicitly define Dφ(µ ν) to be if µ ν does not hold. In Husain et al. (2019), the Lipschitz constrained version of the non-tight variational representation of Dφ is shown to be a lower bound to the Wasserstein autoencoder objective. In Laschos et al. (2019), it is proved that the supremum in the Donsker-Varadhan formula can equivalently be taken over Lipschitz continuous functions. In Song & Ermon (2020), based on the non-tight representation, another generalization of f-GANs and WGAN is proposed, with the importance weights r analogous to the gradient of D φ(f ν) in our case. Connections to density ratio estimation and sample reweighting are discussed, which apply to our case as well. In Arbel et al. (2021), the Lipschitz constrained version of the Donsker Varadhan formula is proposed as an objective function for energy-based models. For representation learning by mutual information maximization, Ozair et al. (2019) proposes the Lipschitz constrained version of the Donsker-Varadhan formula as a proxy for mutual information, which is shown to be empirically superior to the unconstrained formulation. In Zhou et al. (2019), it is shown that Lipschitz regularization improves the performance of GANs in general other than the Wasserstein GAN. The uniqueness of the optimal critic is investigated, and formulas are proposed for which uniqueness holds. We solve the uniqueness problem in another way, by implementing the quotient critic. To summarize, the recognition of the primal formula being the Moreau-Yosida regularization of Dφ with respect to W1 and the case α = 1 are novel to our paper. This includes the unconstrained variational formula for W1. Regarding f-divergences, the tight variational representation over the quotient space Lip(X, x0) and the characterization of Csisz ar potentials are new as well. Additionally, we allow the same generality in terms of the choice of φ as Agrawal & Horel (2020). On the practical side, we proposed an algorithm to calculate the tight conjugate D φ (f ν) and its gradient. Experimentally, implementations are provided for GANs based on the tight variational representation not only of the Kullback-Leibler divergence, but the reverse Kullback-Leibler, χ2, reverse χ2, squared Hellinger, Jensen Shannon, Jeffreys, triangular discrimination and total variation divergences as well. 7. Conclusions In this paper, we studied the Moreau-Yosida regularization of f-divergences with respect to the Wasserstein-1 metric in a convex duality framework. We presented variational formulas and characterizations of optimal variables, generalizing a number of existing results and leading to novel special cases of interest, and proposed the MYf-GAN as an implementation of the formulas. Future directions include finding the variational formulas for Moreau-Yosida approximation with respect to all Wasserstein-p metrics including the case 0 < α < 1, improving the estimation of the Lipschitz norm of the critic, making use of the fact that Csisz ar-Kantorovich potentials can be seen as Lipschitzregularized statistical tests, e.g. for sample reweighting, and scaling up to higher-dimensional datasets. Additionally, the results can potentially be applied to learning algorithms other than GANs, such as representation learning by mutual information maximization, energy-based models, generalized prediction functions and density ratio estimation. Acknowledgements The author was supported by the Hungarian National Excellence Grant 2018-1.2.1-NKP-00008 and by the Hungarian Ministry of Innovation and Technology NRDI Office within the framework of the Artificial Intelligence National Laboratory Program, and would like to thank the Artificial Intelligence Research Group at the Alfr ed R enyi Institute of Mathematics, especially D aniel Varga and Diego Gonz alez S anchez, for their generous help. Moreau-Yosida f-divergences Adler, J. and Lunz, S. Banach wasserstein GAN. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pp. 6755 6764, 2018. Agrawal, R. and Horel, T. Optimal bounds between $f$- divergences and integral probability metrics. Co RR, abs/2006.05973, 2020. Ali, S. M. and Silvey, S. D. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B, 28(1):131 142, 1966. Arbel, M., Zhou, L., and Gretton, A. Generalized energy based models. In International Conference on Learning Representations, 2021. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 214 223. PMLR, 2017. Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Hjelm, R. D., and Courville, A. C. Mutual information neural estimation. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 530 539. PMLR, 2018. Birrell, J., Dupuis, P., Katsoulakis, M. A., Pantazis, Y., and Rey-Bellet, L. (f, Γ)-divergences: Interpolating between f-divergences and integral probability metrics. Co RR, abs/2011.05953, 2020. URL https://arxiv.org/ abs/2011.05953. Blondel, M., Martins, A. F. T., and Niculae, V. Learning with fenchel-young losses. J. Mach. Learn. Res., 21: 35:1 35:69, 2020. Borwein, J. M. and Lewis, A. S. Partially-finite programming in l1 and the existence of maximum entropy estimates. SIAM J. Optim., 3(2):248 267, 1993. doi: 10.1137/0803012. Boyd, S. P. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2014. ISBN 978-0-52183378-3. doi: 10.1017/CBO9780511804441. Cobzas , S ., Miculescu, R., and Nicolae, A. Lipschitz Functions. Lecture Notes in Mathematics. Springer International Publishing, 2019. ISBN 9783030164881. Csisz ar, I. Eine informationstheoretische ungleichung und ihre anwendung auf den beweis der ergodizit at von markoffschen ketten. A Magyar Tudom anyos Akad emia Matematikai Kutat o Int ezet enek K ozlem enyei, 8(1 2):85 108, 1963. Csisz ar, I. Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica, 2:299 318, 1967. Csisz ar, I., Gamboa, F., and Gassiat, E. MEM pixel correlated solutions for generalized moment and interpolation problems. IEEE Trans. Inf. Theory, 45(7):2253 2270, 1999. doi: 10.1109/18.796367. Dal Maso, G. An Introduction to Γ-Convergence. Birkh auser Boston, Boston, MA, 1993. ISBN 978-1-4612-0327-8. doi: 10.1007/978-1-4612-0327-8. Donsker, M. D. and Varadhan, S. R. S. Asymptotic evaluation of certain markov process expectations for large time iii. Communications on Pure and Applied Mathematics, 29(4):389 461, 1976. doi: 10.1002/cpa. 3160290405. Farnia, F. and Tse, D. A convex duality framework for gans. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pp. 5254 5263, 2018. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 5767 5777, 2017. Hanin, L. G. An extension of the kantorovich norm. Contemporary Mathematics, 226:113 130, 1999. Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. Moreau-Yosida f-divergences Husain, H., Nock, R., and Williamson, R. C. A primal-dual link between gans and autoencoders. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 413 422, 2019. Jost, J. and Li-Jost, X. Calculus of Variations. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2008. ISBN 9780521057127. Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. Laschos, V., Obermayer, K., Shen, Y., and Stannat, W. A fenchel-moreau-rockafellar type theorem on the kantorovich-wasserstein space with applications in partially observable markov decision processes. Journal of Mathematical Analysis and Applications, 477(2):1133 1156, 2019. ISSN 0022-247X. doi: https://doi.org/10. 1016/j.jmaa.2019.05.004. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. Mohebi, H. Topical Functions and their Properties in a Class of Ordered Banach Spaces, pp. 343 361. Springer US, Boston, MA, 2005. ISBN 978-0-387-26771-5. doi: 10.1007/0-387-26771-9 12. Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 271 279, 2016. Ozair, S., Lynch, C., Bengio, Y., van den Oord, A., Levine, S., and Sermanet, P. Wasserstein dependency measure for representation learning. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 15578 15588, 2019. Petzka, H., Fischer, A., and Lukovnikov, D. On the regularization of wasserstein gans. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. Reid, M. D. and Williamson, R. C. Information, divergence and risk for binary experiments. J. Mach. Learn. Res., 12: 731 817, 2011. Ruderman, A., Reid, M. D., Garc ıa-Garc ıa, D., and Petterson, J. Tighter variational representations of f-divergences via restriction to probability measures. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress, 2012. Song, J. and Ermon, S. Bridging the gap between f-gans and wasserstein gans. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 9078 9087. PMLR, 2020. Tao, T. Several Variable Differential Calculus, pp. 127 161. Springer Singapore, Singapore, 2016. ISBN 978-981-101804-6. doi: 10.1007/978-981-10-1804-6 6. Terj ek, D. Adversarial lipschitz regularization. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. Villani, C. Optimal transport Old and new, volume 338, pp. xxii+973. 01 2008. doi: 10.1007/978-3-540-71050-9. Weaver, N. Lipschitz Algebras. WORLD SCIENTIFIC, 2nd edition, 2018. doi: 10.1142/9911. Zalinescu, C. Convex Analysis in General Vector Spaces. World Scientific, 2002. ISBN 9789812380678. Zhou, Z., Liang, J., Song, Y., Yu, L., Wang, H., Zhang, W., Yu, Y., and Zhang, Z. Lipschitz generative adversarial nets. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 7584 7593. PMLR, 2019.