# smoothness_and_stability_in_gans__c765dbef.pdf Published as a conference paper at ICLR 2020 SMOOTHNESS AND STABILITY IN GANS Casey Chu Stanford University caseychu@stanford.edu Kentaro Minami Preferred Networks, Inc. minami@preferred.jp Kenji Fukumizu The Institute of Statistical Mathematics / Preferred Networks, Inc. fukumizu@ism.ac.jp Generative adversarial networks, or GANs, commonly display unstable behavior during training. In this work, we develop a principled theoretical framework for understanding the stability of various types of GANs. In particular, we derive conditions that guarantee eventual stationarity of the generator when it is trained with gradient descent, conditions that must be satisfied by the divergence that is minimized by the GAN and the generator s architecture. We find that existing GAN variants satisfy some, but not all, of these conditions. Using tools from convex analysis, optimal transport, and reproducing kernels, we construct a GAN that fulfills these conditions simultaneously. In the process, we explain and clarify the need for various existing GAN stabilization techniques, including Lipschitz constraints, gradient penalties, and smooth activation functions. 1 INTRODUCTION: TAMING INSTABILITY WITH SMOOTHNESS Generative adversarial networks (Goodfellow et al., 2014), or GANs, are a powerful class of generative models defined through minimax game. GANs and their variants have shown impressive performance in synthesizing various types of datasets, especially natural images. Despite these successes, the training of GANs remains quite unstable in nature, and this instability remains difficult to understand theoretically. Since the introduction of GANs, there have been many techniques proposed to stabilize GANs training, including studies of new generator/discriminator architectures, loss functions, and regularization techniques. Notably, Arjovsky et al. (2017) proposed Wasserstein GAN (WGAN), which in principle avoids instability caused by mismatched generator and data distribution supports. In practice, this is enforced by Lipschitz constraints, which in turn motivated developments like gradient penalties (Gulrajani et al., 2017) and spectral normalization (Miyato et al., 2018). Indeed, these stabilization techniques have proven essential to achieving the latest state-of-the-art results (Karras et al., 2018; Brock et al., 2019). On the other hand, a solid theoretical understanding of training stability has not been established. Several empirical observations point to an incomplete understanding. For example, why does applying a gradient penalty together spectral norm seem to improve performance (Miyato et al., 2018), even though in principle they serve the same purpose? Why does applying only spectral normalization with the Wasserstein loss fail (Miyato, 2018), even though the analysis of Arjovsky et al. (2017) suggests it should be sufficient? Why is applying gradient penalties effective, even outside their original context of the Wasserstein GAN (Fedus et al., 2018)? In this work, we develop a framework to analyze the stability of GAN training that resolves these apparent contradictions and clarifies the roles of these regularization techniques. Our approach considers the smoothness of the loss function used. In optimization, smoothness is a well-known condition that ensures that gradient descent and its variants become stable (see e.g., Bertsekas (1999)). For example, the following well-known proposition is the starting point of our stability analysis: Proposition 1 (Bertsekas (1999), Proposition 1.2.3). Suppose f : Rp R is L-smooth and bounded below. Let xk+1 := xk 1 L f(xk). Then || f(xk)|| 0 as k . Published as a conference paper at ICLR 2020 This proposition says that under a smoothness condition on the function, gradient descent with a constant step size 1 L approaches stationarity (i.e., the gradient norm approaches zero). This is a rather weak notion of convergence, as it does not guarantee that the iterates converge to a point, and even if the iterates do converge, the limit is a stationary point and not necessarily an minimizer. Nevertheless, empirically, not even this stationarity is satisfied by GANs, which are known to frequently destabilize and diverge during training. To diagnose this instability, we consider the smoothness of the GAN s loss function. GANs are typically framed as minimax problems of the form inf θ sup ϕ J (µθ, ϕ), (1) where J is a loss function that takes a generator distribution µθ and discriminator ϕ, and θ Rp denotes the parameters of the generator. Unfortunately, the minimax nature of this problem makes stability and convergence difficult to analyze. To make the analysis more tractable, we define J(µ) = supϕ J (µ, ϕ), so that (1) becomes simply inf θ J(µθ). (2) This choice corresponds to the common assumption that the discriminator is allowed to reach optimality at every training step. Now, the GAN algorithm can be regarded as simply gradient descent on the Rp R function θ 7 J(µθ), which may be analyzed using Proposition 1. In particular, if this function θ 7 J(µθ) satisfies the smoothness assumption, then the GAN training should be stable in that it should approach stationarity under the assumption of an optimal discriminator. In the remainder of this paper, we investigate whether the smoothness assumption is satisfied for various GAN losses. Our analysis answers two questions: Q1. Which existing GAN losses, if any, satisfy the smoothness condition in Proposition 1? Q2. Are there choices of loss, regularization, or architecture that enforce smoothness in GANs? As results of our analysis, our contributions are as follows: 1. We derive sufficient conditions for the GAN algorithm to be stationary under certain assumptions (Theorem 1). Our conditions relate to the smoothness of GAN loss used as well as the parameterization of the generator. 2. We show that most common GAN losses do not satisfy the all of the smoothness conditions, thereby corroborating their empirical instability. 3. We develop regularization techniques that enforce the smoothness conditions. These regularizers recover common GAN stabilization techniques such as gradient penalties and spectral normalization, thereby placing their use on a firmer theoretical foundation. 4. Our analysis provides several practical insights, suggesting for example the use of smooth activation functions, simultaneous spectral normalization and gradient penalties, and a particular learning rate for the generator. 1.1 RELATED WORK Divergence minimization Our analysis regards the GAN algorithm as minimizing a divergence between the current generator distribution and the desired data distribution, under the assumption of an optimal discriminator at every training step. This perspective originates from the earliest GAN paper, in which Goodfellow et al. (2014) show that the original minimax GAN implicitly minimizes the Jensen Shannon divergence. Since then, the community has introduced a large number of GAN or GAN-like variants that learn generative models by implicitly minimizing various divergences, including f-divergences (Nowozin et al., 2016), Wasserstein distance (Arjovsky et al., 2017), and maximum-mean discrepancy (Li et al., 2015; Unterthiner et al., 2018). Meanwhile, the non-saturating GAN (Goodfellow et al., 2014) has been shown to minimize a certain Kullback Leibler divergence (Arjovsky & Bottou, 2017). Several more theoretical works consider the topological, geometric, and convexity properties of divergence minimization (Arjovsky & Bottou, 2017; Liu et al., 2017; Bottou et al., 2018; Farnia & Tse, 2018; Chu et al., 2019), perspectives that we draw heavily upon. Sanjabi et al. (2018) also prove smoothness of GAN losses in the specific case of the regularized optimal transport loss. Their assumption for smoothness is entangled in that it involves a composite condition on generators and discriminators, while our analysis addresses them separately. Published as a conference paper at ICLR 2020 Table 1: Common GANs, their corresponding loss functions, and their optimal discriminators. Loss function J(µ) Optimal discriminator Φµ(x) Minimax GAN DJS(µ || µ0) 1 2 log µ(x) µ(x)+µ0(x) Non-saturating GAN DKL( 1 2µ0 || µ0) 1 2 log µ0(x) µ(x)+µ0(x) Wasserstein GAN W1(µ, µ0) arg maxf Lip1 Ey µ[f(y)] Ey µ0[f(y)] GMMN, Coulomb GAN 1 2MMD2(µ, µ0) Ey µ[K(x, y)] Ey µ0[K(x, y)] IPM-GAN IPMF(µ, µ0) arg maxf F Ey µ[f(y)] Ey µ0[f(y)] Other approaches Even though many analyses, including ours, operate under the assumption of an optimal discriminator, this assumption is unrealistic in practice. Li et al. (2017b) contrast this optimal discriminator dynamics with first-order dynamics, which assumes that the generator and discriminator use alternating gradient updates and is what is used computationally. As this is a differing approach from ours, we only briefly mention some results in this area, which typically rely on game-theoretic notions (Kodali et al., 2017; Grnarova et al., 2018; Oliehoek et al., 2018) or local analysis (Nagarajan & Kolter, 2017; Mescheder et al., 2018). Some of these results rely on continuous dynamics approximations of gradient updates; in contrast, our work focuses on discrete dynamics. 1.2 NOTATION Let R := R { , }. We let P(X) denote the set of all probability measures on a compact set X Rd. We let M(X) and C(X) denote the dual pair consisting of the set of all finite signed measures on X and the set of all continuous functions X R. For any statement A, we let χ{A} be 0 if A is true and if A is false. For a Euclidean vector x, its Euclidean norm is denoted by x 2, and the operator norm of a matrix A is denoted by A 2, i.e., A 2 = sup x 2 1 Ax 2/ x 2. A function f : X Y between two metric spaces is α-Lipschitz if d Y (f(x1), f(x2)) αd X(x1, x2). A function f : Rd R is β-smooth if its gradients are β-Lipschitz, that is, for all x, y Rd, f(x) f(y) 2 β x y 2. 2 SMOOTHNESS OF GAN LOSSES This section presents Theorem 1, which provides concise criteria for the smoothness of GAN losses. In order to keep our analysis agnostic to the particular GAN used, let J : P(X) R be an arbitrary convex loss function, which takes a distribution over X Rd and outputs a real number. Note that the typical minimax formulation of GANs can be recovered from just the loss function J using convex duality. In particular, recall that the convex conjugate J : C(X) R of J satisfies the following remarkable duality, known as the Fenchel Moreau theorem: J (ϕ) := sup µ M(X) Z ϕ(x) dµ J(µ), J(µ) = sup ϕ C(X) Z ϕ(x) dµ J (ϕ). (3) Based on this duality, minimizing J can be framed as the minimax problem inf µ P(X) J(µ) = inf µ P(X) sup ϕ C(X) Z ϕ(x) dµ J (ϕ) := inf µ P(X) sup ϕ C(X) J (µ, ϕ), (4) recovering the well-known adversarial formulation of GANs. We now define the notion of an optimal discriminator for an arbitrary loss function J, based on this convex duality: Definition 1 (Optimal discriminator). Let J : M(X) R be a convex, l.s.c., proper function. An optimal discriminator for a probability distribution µ P(X) is a continuous function Φµ : X R that attains the maximum of the second equation in (3), i.e., J(µ) = R Φµ(x) dµ J (Φµ). This definition recovers the optimal discriminators of many existing GAN and GAN-like algorithms (Farnia & Tse, 2018; Chu et al., 2019), most notably those in Table 1. Our analysis will apply to any algorithm in this family of algorithms. See Appendix B for more details on this perspective. Published as a conference paper at ICLR 2020 We also formalize the notion of a family of generators: Definition 2 (Family of generators). A family of generators is a set of pushforward probability measures {µθ = fθ#ω : θ Rp}, where ω is a fixed probability distribution on Z (the latent variable) and fθ : Z X is a measurable function (the generator). Now, in light of Proposition 1, we are interested in the smoothness of the mapping θ 7 J(µθ), which would guarantee the stationarity of gradient descent on this objective, which in turn implies stationarity of the GAN algorithm under the assumption of an optimal discriminator. The following theorem is our central result, which decomposes the smoothness of θ 7 J(µθ) into conditions on optimal discriminators and the family of generators. Theorem 1 (Smoothness decomposition for GANs). Let J : M(X) R be a convex function whose optimal discriminators Φµ : X R satisfy the following regularity conditions: (D1) x 7 Φµ(x) is α-Lipschitz, (D2) x 7 xΦµ(x) is β1-Lipschitz, (D3) µ 7 xΦµ(x) is β2-Lipschitz w.r.t. the 1-Wasserstein distance. Also, let µθ = fθ#ω be a family of generators that satisfies: (G1) θ 7 fθ(z) is A-Lipschitz in expectation for z ω, i.e., Ez ω[ fθ1(z) fθ2(z) 2] A θ1 θ2 2, and (G2) θ 7 Dθfθ(z) is B-Lipschitz in expectation for z ω, i.e., Ez ω[ Dθ1fθ1(z) Dθ2fθ2(z) 2] B θ1 θ2 2. Then θ 7 J(µθ) is L-smooth, with L = αB + A2(β1 + β2). Theorem 1 connects the smoothness properties of the loss function J with the smoothness properties of the optimal discriminator Φµ, and once paired with Proposition 1, it suggests a quantitative value 1 L for a stable generator learning rate. In order to obtain claims of stability for practically sized learning rates, it is important to tightly bound the relevant constants. In Sections 4 to 6, we carefully analyze which GAN losses satisfy (D1), (D2), and (D3), and with what constants. We summarize our results in Table 2: it turns out that none of the listed losses, except for one, satisfy (D1), (D2), and (D3) simultaneously with a finite constant. The MMD-based loss satisfies the three conditions, but its constant for (D1) grows as α = O( d), which is an unfavorable dependence on the data dimension d that forces an unacceptably small learning rate. See for complete details of each condition. This failure of existing GANs to satisfy the stationarity conditions corroborates the observed instability of GANs. Table 2: Regularity of common GAN losses. (D1) (D2) (D3) Minimax GAN Non-saturating GAN Wasserstein GAN ? IPMS ? MMD2 Theorem 1 decomposes smoothness into conditions on the generator and conditions on the discriminator, allowing a clean separation of concerns. In this paper, we focus on the discriminator conditions (D1), (D2), and (D3) and only provide an extremely simple example of a generator that satisfies (G1) and (G2), in Section 7. Because analysis of the generator conditions may become quite complicated and will vary with the choice of architecture considered (feedforward, convolutional, Res Net, etc.), we leave a detailed analysis of the generator conditions (G1) and (G2) as a promising avenue for future work. Indeed, such analyses may lead to new generator architectures or generator regularization techniques that stabilize GAN training. Published as a conference paper at ICLR 2020 3 ENFORCING SMOOTHNESS WITH INF-CONVOLUTIONS In this section, we present a generic regularization technique that imposes the three conditions sufficient for stable learning on an arbitrary loss function J, thereby stabilizing training. In Section 2, we observe that the Wasserstein, IPM, and MMD losses respectively satisfy (D1), (D2), and (D3) individually, but not all of of them at the same time. Using techniques from convex analysis, we convert these three GAN losses into three regularizers that, when applied simultaneously, causes the resulting loss to satisfy all the three conditions. Here, we only outline the technique; the specifics of each case are deferred to Sections 4 to 6. We start with an arbitrary base loss function J to be regularized. Next, we take an existing GAN loss that satisfies the desired regularity condition and convert it into a regularizer function R : M(X) R. Then, we consider J R, which denotes the inf-convolution defined as (J R)(ξ) = inf ξ M(X) J( ξ) + R(ξ ξ). (5) This new function J R inherits the regularity of R, making it a stable candidate as a GAN loss. Moreover, because the inf-convolution is a commutative operation, we can sequentially apply multiple regularizers R1, R2, and R3 without destroying the added regularity. In particular, if we carefully choose functions R1, R2, and R3, then J = J R1 R2 R3 will satisfy (D1), (D2), and (D3) simultaneously. Moreover, under some technical assumptions, this composite function J inherits the original minimizers of J, making it a sensible GAN loss: Proposition 2 (Invariance of minimizers). Let R1(ξ) := ξ KR, R2(ξ) := ξ S , and R3(ξ) := 1 2 ˆξ 2 H be the three regularizers defined by (8), (12), and (19) respectively. Assume that J : M(X) R has a unique minimizer at µ0 with J(µ0) = 0, and J(µ) c ˆµ ˆµ0 H for some c > 0. Then the inf-convolution J = J R1 R2 R3 has a unique minimizer at µ0 with J(µ0) = 0. The duality formulation (4) provides a practical method for minimizing this composite function. We leverage the duality relation (J R1 R2 R3) = J + R 1 + R 2 + R 3 and apply (4): inf µ (J R1 R2 R3)(µ) = inf µ sup ϕ Z ϕ dµ J (ϕ) R 1(ϕ) R 2(ϕ) R 3(ϕ) (6) = inf µ sup ϕ J (µ, ϕ) R 1(ϕ) R 2(ϕ) R 3(ϕ). (7) This minimax problem can be seen as a GAN whose discriminator objective has three added regularization terms. The concrete form of these regularizers are summarized in Table 3. Notably, we observe that we recover standard techniques for stabilizing GANs: (D1) is enforced by Lipschitz constraints (i.e., spectral normalization) on the discriminator. (D2) is enforced by spectral normalization and a choice of Lipschitz, smooth activation functions for the discriminator. Table 3: Smoothness-inducing regularizers and their convex conjugates. To enforce a regularity condition on a loss function J, we take a source loss that satisfies it and view it as a regularizer R. We then consider the inf-convolution J R, which corresponds to an added regularization term R on the discriminator ϕ. These regularization terms correspond to existing GAN techniques. Purpose Source loss R(ξ) R (ϕ) GAN techniques (D1) W1 ξ KR χ{ ϕ Lip 1} spectral norm (D2) IPMS ξ S χ{ϕ S} smooth activations, spectral norm (D3) MMD2 1 2 ξ 2 H 1 2 ϕ 2 H gradient penalties Published as a conference paper at ICLR 2020 (D3) is enforced by gradient penalties on the discriminator. Our analysis therefore puts these regularization techniques on a firm theoretical foundation (Proposition 1 and Theorem 1) and provides insight into their function. 4 ENFORCING (D1) WITH LIPSCHITZ CONSTRAINTS In this section, we show that enforcing (D1) leads to techniques and notions commonly used to stabilize GANs, including the Wasserstein distance, Lipschitz constraints and spectral normalization. Recall that (D1) demands that the optimal discriminator Φµ is Lipschitz: (D1) x 7 Φµ(x) is α-Lipschitz for all µ P(X), i.e., |Φµ(x) Φµ(y)| α||x y||2. If Φµ is differentiable, this is equivalent to that the optimal discriminator has a gradient with bounded norm. This is a sensible criterion, since a discriminator whose gradient norm is too large may push the generator too hard and destabilize its training. To check (D1), the following proposition shows that it suffices to check whether |J(µ) J(ν)| αW1(µ, ν) for all distributions µ, ν: Proposition 3. (D1) holds if and only if J is α-Lipschitz w.r.t. the Wasserstein-1 distance. Arjovsky et al. (2017) show that this property does not hold for common divergences based on the Kullback Leibler or Jensen Shannon divergence, while it does hold for the Wasserstein-1 distance. Indeed, it is this desirable property that motivates their introduction of the Wasserstein GAN. Framed in our context, their result is summarized as follows: Proposition 4. The minimax and non-saturating GAN losses do not satisfy (D1) for some µ0. Proposition 5. The Wasserstein GAN loss satisfies (D1) with α = 1 for any µ0. Our stability analysis therefore deepens the analysis of Arjovsky et al. (2017) and provides an alternative reason that the Wasserstein distance is desirable as a metric: it is part of a sufficient condition that ensures stationarity of gradient descent. 4.1 FROM WASSERSTEIN DISTANCE TO LIPSCHITZ CONSTRAINTS Having identified the Wasserstein GAN loss as one that satisfies (D1), we next follow the strategy outlined in Section 3 to convert it into a regularizer for an arbitrary loss function. Towards this, we define the regularizer R1 : M(X) R and compute its convex conjugate R 1 : C(X) R: R1(ξ) := α ξ KR = α sup f C(X) ||f||Lip 1 Z f dξ, R 1(ϕ) = 0 ϕ Lip α otherwise. (8) This norm is the Kantorovich Rubinstein norm (KR norm), which extends the Wasserstein-1 distance to M(X); it holds that µ ν KR = W1(µ, ν) for µ, ν P(X). Then, its inf-convolution with an arbitrary function inherits the Lipschitz property held by R1: Proposition 6 (Pasch Hausdorff). Let J : M(X) R be a function, and define J := J R1. Then J is α-Lipschitz w.r.t. the distance induced by the KR norm, and hence the Wasserstein-1 distance when restricted to P(X). Due to Proposition 3, we now obtain a transformed loss function J that automatically satisfies (D1). This function is a generalization of the Pasch Hausdorff envelope (see Chapter 9 in Rockafeller & Wets (1998)), also known as Lipschitz regularization or the Mc Shane Whitney extension (Mc Shane, 1934; Whitney, 1934; Kirszbraun, 1934; Hiriart-Urruty, 1980). The convex conjugate computation in (8) shows that J can be minimized in practice by imposing Lipschitz constraints on discriminators. Indeed, by (4), inf µ (J α KR)(µ) = inf µ sup ϕ Ex µ[ϕ(x)] J (ϕ) χ{ ϕ Lip α} (9) Published as a conference paper at ICLR 2020 = inf µ sup ϕ: ϕ Lip α J (µ, ϕ). (10) Farnia & Tse (2018) consider this loss J in the special case of an f-GAN with J(µ) = Df(µ, µ0); they showed that minimizing J corresponds to training a f-GAN normally but constraining the discriminator to be α-Lipschitz. We show that this technique is in fact generic for any J: minimizing the transformed loss can be achieved by training the GAN as normal, but imposing a Lipschitz constraint on the discriminator. Our analysis therefore justifies the use of Lipschitz constraints, such as spectral normalization (Miyato et al., 2018) and weight clipping (Arjovsky & Bottou, 2017), for general GAN losses. However, Theorem 1 also suggests that applying only Lipschitz constraints may not be enough to stabilize GANs, as (D1) alone does not ensure that the GAN objective is smooth. 5 ENFORCING (D2) WITH DISCRIMINATOR SMOOTHNESS (D2) demands that the optimal discriminator Φµ is smooth: (D2) x 7 Φµ(x) is β1-Lipschitz for all µ P(X), i.e., Φµ(x) Φµ(y) 2 β1 x y 2. Intuitively, this says that for a fixed generator µ, the optimal discriminator Φµ should not provide gradients that change too much spatially. Although the Wasserstein GAN loss (D1), we see that it, along with the minimax GAN and the non-saturating GAN, do not satisfy (D2): Proposition 7. The Wasserstein, minimax, and non-saturating GAN losses do not satisfy (D2) for some µ0. We now construct a loss that by definition satisfies (D2). Let S be the class of 1-smooth functions, that is, for which f(x) f(y) 2 x y 2, and consider the integral probability metric (IPM) (M uller, 1997) w.r.t. S, defined by IPMS(µ, ν) := sup f S Z f dµ Z f dν. (11) The optimal discriminator for the loss IPMS(µ, µ0) is the function that maximizes the supremum in the definition. This function by definition belongs to S and therefore is 1-smooth. Hence, this IPM loss satisfies (D2) with β1 = 1 by construction. 5.1 FROM INTEGRAL PROBABILITY METRIC TO SMOOTH DISCRIMINATORS Having identified the IPM-based loss as one that satisfies (D2), we next follow the strategy outlined in Section 3 to convert it into a regularizer for an arbitrary loss function. To do so, we define a regularizer R2 : M(X) R and compute its convex conjugate R 2 : C(X) R: R2(ξ) := β1 ξ S = β1 sup f S Z f dξ, R 2(ϕ) = 0 ϕ β1S otherwise. (12) The norm is the dual norm to S, which extends the IPM to signed measures; it holds that IPMS(µ, ν) = µ ν S for µ, ν P(X). Similar to the situation in the previous section, inf-convolution preserves the smoothness property of R2: Proposition 8. Let J : M(X) R be a convex, proper, lower semicontinuous function, and define J := J R2. Then the optimal discriminator for J is β1-smooth. Applying (4) and (12), we see that we can minimize this transformed loss function by restricting the family of discriminators to only β1-smooth discriminators: inf µ (J β1 S )(µ) = inf µ sup ϕ Ex µ[ϕ(x)] J (ϕ) χ{ϕ β1S} (13) = inf µ sup ϕ β1S J (µ, ϕ). (14) In practice, we can enforce this by applying spectral normalization (Miyato et al., 2018) and using a Lipschitz, smooth activation function such as ELU (Clevert et al., 2016) or sigmoid. Published as a conference paper at ICLR 2020 Proposition 9. Let f : Rd R be a neural network consisting of k layers whose linear transformations have spectral norm 1 and whose activation functions are 1-Lipschitz and 1-smooth. Then f is k-smooth. 6 ENFORCING (D3) WITH GRADIENT PENALTIES (D3) is the following smoothness condition: (D3) µ 7 Φµ(x) is β2-Lipschitz for any x X, i.e., Φµ(x) Φν(x) 2 β2W1(µ, ν). (D3) requires that the gradients of the optimal discriminator do not change too rapidly in response to changes in µ. Indeed, if the discriminator s gradients are too sensitive to changes in the generator, the generator may not be able to accurately follow those gradients as it updates itself using a finite step size. In finite-dimensional optimization of a function f : Rd R, this condition is analogous to f having a Lipschitz gradient. We now present an equivalent characterization of (D3) that is easier to check in practice. We define the Bregman divergence of a convex function J : M(X) R by DJ(ν, µ) := J(ν) J(µ) Z Φµ(x) d(ν µ), (15) where Φµ is the optimal discriminator for J at µ. Then, (D3) is characterized in terms of the Bregman divergence and the KR norm as follows: Proposition 10. Let J : M(X) R be a convex function. Then J satisfies (D3) if and only if DJ(ν, µ) β2 2 µ ν 2 KR for all µ, ν M(X). It is straightforward to compute the Bregman divergence corresponding to several popular GANs: DDJS( || µ0)(ν, µ) = DKL( 1 2DKL(ν || µ), (16) 2 µ0 || µ0)(ν, µ) = DKL( 1 2 MMD2( ,µ0)(ν, µ) = 1 2MMD2(ν, µ). (18) The first two Bregman divergences are not bounded above by ||µ ν||2 KR for reasons similar to those discussed in Section 4, and hence: Proposition 11. The minimax and non-saturating GAN losses do not satisfy (D3) for some µ0. Even so, the Bregman divergence for the non-saturating loss is always less than that of the minimax GAN, suggesting that the non-saturating loss should be stable in more situations than the minimax GAN. On the other hand, the MMD-based loss (Li et al., 2015) does satisfy (D3) when its kernel is the Gaussian kernel K(x, y) = e π||x y||2: Proposition 12. The MMD loss with Gaussian kernel satisfies (D3) with β2 = 2π for all µ0. 6.1 FROM MAXIMUM MEAN DISCREPANCY TO GRADIENT PENALTIES Having identified the MMD-based loss as one that satisfies (D3), we next follow the strategy outlined in Section 3 to convert it into a regularizer for an arbitrary loss function. To do so, we define the regularizer R3 : M(X) R and compute its convex conjugate R 3 : C(X) R: R3(ξ) := β2 4π ˆξ 2 H, R 3(ϕ) = π β2 ϕ 2 H. (19) The norm is the norm of a reproducing kernel Hilbert space norm (RKHS) H with Gaussian kernel; this norm extends the MMD to signed measures, as it holds that MMD(µ, ν) = ˆµ ˆν H for µ, ν P(X). Here, ˆξ = R K(x, ) ξ(dx) H denotes the mean embedding of a signed measure ξ M(X); we also adopt the convention that ||ϕ||H = if ϕ H. Similar to the situation in the previous sections, inf-convolution preserves the smoothness property of R3: Proposition 13 (Moreau Yosida regularization). Suppose J : M(X) R is convex, and define J := J R3. Then J is convex, and D J(ν, µ) β2 2 ||µ ν||2 KR. Published as a conference paper at ICLR 2020 By Proposition 10, this transformed loss function satisfies (D3), having inherited the regularity properties of the squared MMD. This transformed function is a generalization of Moreau Yosida regularization or the Moreau envelope (see Chapter 1 in Rockafeller & Wets (1998)). It is well-known that in the case of a function f : Rn R, this regularization results in a function with Lipschitz gradients, so it is unsurprising that this property carries over to the infinite-dimensional case. Applying (4) and (19), we see that the transformed loss function can be minimized as a GAN by implementing an RKHS squared norm penalty on the discriminator: inf µ (J β2 4π|| ||2 H)(µ) = inf µ sup ϕ Ex µ[ϕ(x)] J (ϕ) π β2 ||ϕ||2 H. (20) Computationally, the RKHS norm is difficult to evaluate. We propose taking advantage of the following infinite series representation of ||f||2 H in terms of the derivatives of f (Fasshauer & Ye, 2011; Novak et al., 2018): Proposition 14. Let H be an RKHS with the Gaussian kernel K(x, y) = e π||x y||2. Then for f H, k=0 (4π) k X 1 Qd i=1 ki! || k1 x1 kd xdf||2 L2(Rd) (21) = ||f||2 L2(Rd) + 1 4π|| f||2 L2(Rd) + 1 16π2 || 2f||2 L2(Rd) + other terms. (22) In an ideal world, we would use this expression as a penalty on the discriminator to enforce (D3). Of course, as an infinite series, this formulation is computationally impractical. However, the first two terms are very close to common GAN techniques like gradient penalties (Gulrajani et al., 2017) and penalizing the output of the discriminator (Karras et al., 2018). We therefore interpret these common practices as partially applying the penalty given by the RKHS norm squared, approximately enforcing (D3). We view the choice of only using the leading terms as a disadvantageous but practical necessity. Interestingly, according to our analysis, gradient penalties and spectral normalization are not interchangeable, even though both techniques were designed to constrain the Lipschitz constant of the discriminator. Instead, our analysis suggests that they serve different purposes: gradient penalties enforce the variational smoothness (D3), while spectral normalization enforces Lipschitz continuity (D1). This demystifies the puzzling observation of Miyato (2018) that GANs using only spectral normalization with a WGAN loss do not seem to train well; it also explains why using both spectral normalization and a gradient penalty is a reasonable strategy. It also motivates the use of gradient penalties applied to losses other than the Wasserstein loss (Fedus et al., 2018). 7 VERIFYING THE THEORETICAL LEARNING RATE In this section, we empirically test the theoretical learning rate given by Theorem 1 and Proposition 1 as well as our regularization scheme (7) based on inf-convolutions. We approximately implement our composite regularization scheme (7) on a trivial base loss of J(µ) = χ{µ = µ0} by alternating stochastic gradient steps on inf µ sup ϕ Ex µ[ϕ(x)] Ex µ0[ϕ(x)] π β2 Ex µ h ϕ(x)2 + 1 4π || ϕ(x)||2i , (23) where µ is a random interpolate between samples from µ and µ0, as used in Gulrajani et al. (2017). The regularization term is a truncation of the series for the squared RKHS norm (22) and approximately enforces (D3). The discriminator is a 7-layer convolutional neural network with spectral normalization1 and ELU activations, an architecture that enforces (D1) and (D2). We include a final scalar multiplication by α so that by Proposition 9, β1 = 7α. We take two discriminator steps for every generator step, to better approximate our assumption of an optimal discriminator. 1Whereas Miyato et al. (2018) divide each layer by the spectral norm of the convolutional kernel, we divide by the spectral norm of the convolutional operator itself, computed using the same power iteration algorithm applied to the operator. This is so that the layers are truly 1-Lipschitz, which is critical for our theory. Published as a conference paper at ICLR 2020 10 3 100 103 106 109 γ/γ0 Figure 1: FID of simple particle generators with various learning rates. The horizontal axis shows the ratio γ/γ0 between the actual learning rate γ and the theoretical learning rate γ0 suggested by Theorem 1 and Proposition 1. The vertical axis shows the FID after 100,000 SGD iterations. For the generator, we use an extremely simple particle-based generator which satisfies (G1) and (G2), in order to minimize the number of confounding factors in our experiment. Let ω be the discrete uniform distribution on Z = {1, . . . , N}. For an N d matrix θ and z Z, define fθ : Z Rd so that fθ(z) is the zth row of θ. The particle generator µθ = fθ#ω satisfies (G1) with A = 1 Ez[ fθ(z) fθ (z) 2] = 1 z=1 θz θ z 2 1 N θ θ F, (24) and it satisfies (G2) with B = 0, since Dθfθ(z) is constant w.r.t. θ. With this setup, Theorem 1 suggests a theoretical learning rate of L = 1 αB + A2(β1 + β2) = N 7α + β2 . (25) We randomly generated hyperparameter settings for the Lipschitz constant α, the smoothness constant β2, the number of particles N, and the learning rate γ. We trained each model for 100,000 steps on CIFAR-10 and evaluate each model using the Fr echet Inception Distance (FID) of Heusel et al. (2017). We hypothesize that stability is correlated with image quality; Figure 1 plots the FID for each hyperparameter setting in terms of the ratio of the true learning rate γ and the theoretically motivated learning rate γ0. We find that the best FID scores are obtained in the region where γ/γ0 is between 1 and 1000. For small learning rates γ/γ0 1, we observe that the convergence is too slow to make a reasonable progress on the objective, whereas as the learning rate gets larger γ/γ0 1, we observe a steady increase in FID, signalling unstable behavior. It also makes sense that learning rates slightly above the optimal rate produce good results, since our theoretical learning rate is a conservative lower bound. Note that our intention is to test our theory, not to generate good images, which is difficult due to our weak choice of generator. Overall, this experiment shows that our theory and regularization scheme are sensible. 8 FUTURE WORK Inexact gradient descent In this paper, we employed several assumptions in order to regard the GAN algorithm as gradient descent. However, real-world GAN algorithms must be treated as inexact descent algorithms. As such, future work includes: (i) relaxing the optimal discriminator assumption (cf. Sanjabi et al. (2018)) or providing a stability result for discrete simultaneous gradient descent (cf. continuous time analysis in Nagarajan & Kolter (2017); Mescheder et al. (2018)), (ii) addressing stochastic approximations of gradients (i.e., SGD), and (iii) providing error bounds for the truncated gradient penalty used in (23). Generator architectures Another important direction of research is to seek more powerful generator architectures that satisfy our smoothness assumptions (G1) and (G2). In practice, generators are often implemented as deep neural networks, and involve some specific architectures such as deconvolution layers (Radford et al., 2015) and residual blocks (e.g., Gulrajani et al. (2017); Miyato et al. (2018)). In this paper, we did not provide results on the smoothness of general classes of generators, since our focus is to analyze stability properties influenced by the choice of loss function J (and therefore optimal discriminators). However, our conditions (G1) and (G2) shed light on how to obtain smoothly parameterized neural networks, which is left for future work. Published as a conference paper at ICLR 2020 ACKNOWLEDGMENTS We would like to thank Kohei Hayashi, Katsuhiko Ishiguro, Masanori Koyama, Shin-ichi Maeda, Takeru Miyato, Masaki Watanabe, and Shoichiro Yamaguchi for helpful discussions. Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar e. Gradient Flows in Metric Spaces and in the Space of Probability Measures. Birkh auser, second edition, 2008. Michael Arbel, Dougal Sutherland, Mikołaj Bi nkowski, and Arthur Gretton. On gradient regularizers for MMD GANs. In Advances in Neural Information Processing Systems, pp. 6700 6710, 2018. Martin Arjovsky and Leon Bottou. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, 2017. Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214 223, 2017. Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2011. Dimitri P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999. Leon Bottou, Martin Arjovsky, David Lopez-Paz, and Maxime Oquab. Geometrical insights for implicit generative modeling. In Braverman Readings in Machine Learning. Key Ideas from Inception to Current State, pp. 229 268. Springer, 2018. Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019. S ebastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015. Casey Chu, Jose Blanchet, and Peter Glynn. Probability functional descent: A unifying perspective on GANs, variational inference, and reinforcement learning. In International Conference on Machine Learning, 2019. Djork-Arn e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). 2016. Farzan Farnia and David Tse. A convex duality framework for GANs. In Advances in Neural Information Processing Systems, pp. 5248 5258, 2018. Gregory E Fasshauer and Qi Ye. Reproducing kernels of generalized Sobolev spaces via a Green function approach with distributional operators. Numerische Mathematik, 119(3):585 611, 2011. William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M Dai, Shakir Mohamed, and Ian Goodfellow. Many paths to equilibrium: GANs do not need to decrease a divergence at every step. In International Conference on Learning Representations, 2018. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672 2680, 2014. Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(1):723 773, 2012. Paulina Grnarova, Kfir Y Levy, Aurelien Lucchi, Thomas Hofmann, and Andreas Krause. An online learning approach to generative adversarial networks. In International Conference on Learning Representations, 2018. Published as a conference paper at ICLR 2020 Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pp. 5767 5777, 2017. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, 2017. J-B Hiriart-Urruty. Extension of Lipschitz functions. Journal of Mathematical Analysis and Applications, 77(2):539 554, 1980. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. M Kirszbraun. Uber die zusammenziehende und Lipschitzsche Transformationen. Fundamenta Mathematicae, 22(1):77 108, 1934. Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of GANs. ar Xiv preprint ar Xiv:1705.07215, 2017. Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabas Poczos. MMD GAN: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pp. 2203 2213, 2017a. Jerry Li, Aleksander Madry, John Peebles, and Ludwig Schmidt. On the limitations of first-order approximation in GAN dynamics. In International Conference on Machine Learning, 2017b. Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In International Conference on Machine Learning, pp. 1718 1727, 2015. Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pp. 5545 5553, 2017. Edward James Mc Shane. Extension of range of functions. Bulletin of the American Mathematical Society, 40(12):837 842, 1934. Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? In International Conference on Machine Learning, 2018. Takeru Miyato. How to combine WGAN and spectral norm? https://github.com/ pfnet-research/sngan_projection/issues/15, 2018. Accessed: 2019-09-24. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. Alfred M uller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997. Vaishnavh Nagarajan and J. Zico Kolter. Gradient descent GAN optimization is locally stable. In Advances in Neural Information Processing Systems, 2017. Erich Novak, Mario Ullrich, Henryk Wo zniakowski, and Shun Zhang. Reproducing kernels of Sobolev spaces on Rd and applications to embedding constants and tractability. Analysis and Applications, 16(05):693 715, 2018. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271 279, 2016. Frans A Oliehoek, Rahul Savani, Jose Gallego, Elise van der Pol, and Roderich Groß. Beyond local Nash equilibria for adversarial networks. ar Xiv preprint ar Xiv:1806.07268, 2018. Published as a conference paper at ICLR 2020 Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015. R. T. Rockafeller and R. J-B Wets. Variational Analysis, volume 317 of Grundlehren der mathematischen Wissenschaften. Springer-Verlag, Berlin, Heidelberg, 1998. Walter Rudin. The Principles of Mathematical Analysis. Mc Graw-Hill New York, 3rd edition, 1964. Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D Lee. On the convergence and robustness of training GANs with regularized optimal transport. In Advances in Neural Information Processing Systems, 2018. Aaron Sidford. MS&E 213 / CS 269O : Chapter 5 - smooth convex generalizations, 2017. URL http://www.aaronsidford.com/chap_5_extend_v7.pdf. Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Gert R Lanckriet, and Bernhard Sch olkopf. Kernel choice and classifiability for rkhs embeddings of probability distributions. In Advances in neural information processing systems, pp. 1750 1758, 2009. Thomas Str omberg. The operation of infimal convolution. Instytut Matematyczny Polskiej Akademi Nauk, 1996. URL http://eudml.org/doc/271746. Thomas Unterthiner, Bernhard Nessler, Calvin Seward, G unter Klambauer, Martin Heusel, Hubert Ramsauer, and Sepp Hochreiter. Coulomb GANs: Provably optimal Nash equilibria via potential fields. In International Conference on Learning Representations, 2018. C edrik Villani. Optimal Transport: Old and New. Springer, 2009. Hassler Whitney. Analytic extensions of differentiable functions defined in closed sets. Transactions of the American Mathematical Society, 36(1):63 89, 1934. Xingyu Zhou. On the Fenchel duality between strong convexity and Lipschitz continuous gradient. ar Xiv preprint ar Xiv:1803.06573, 2018. A INF-CONVOLUTION IN Rd To gain intuition on the inf-convolution, we present a finite-dimensional analogue of the techniques in Section 3. For simplicity of presentation, we will omit any regularity conditions (e.g., lower semicontinuity). We refer readers to Chapter 12 of Bauschke & Combettes (2011) for a detailed introduction. Let J and R be convex functions on Rd. The inf-convolution of J and R is a function J R defined as (J R)(x) := inf z Rd J(z) + R(x z). The inf-convolution is often called the epigraphic sum since the epigraph of J R coincides with the Minkowski sum of epigraphs of J and R, as Figure 2 illustrates. The inf-convolution is associative and commutative operation; that is, it is always true that (J1 J2) J3 = J1 (J2 J3) =: J1 J2 J3 and J1 J2 = J2 J1. There are two important special cases of inf-convolutions: The first one is the Pasch Hausdorff envelope Jα, which is the inf-convolution between J and α 2 (α > 0). It is known that Jα becomes α-Lipschitz. The second important example is the Moreau envelope Jβ = J 1 2β 2 2, i.e., the inf-convolution with the quadratic regularizer 1 2β 2 2. The Moreau envelope Jβ is always differentiable, and the gradient of Jβ is β-Lipschitz (thus Jβ is β-smooth). It is worth noting that the set of minimizers does not change after these two operations. More generally, we have the following result: Proposition 15. Let J, R : Rd R be proper and lower semicontinuous functions with min J > and min R = 0. Suppose R(0) = 0 and R(x) ψ( x 2) for some increasing function ψ : R 0 R. Then, min J R = min J and arg min J R = arg min J. Published as a conference paper at ICLR 2020 Figure 2: Illustration of inf-convolutions and their convex conjugates in R. Top row: Generally, inf-convolutions inherit the regularity of their parent functions. In this example, J1(x) = |x| is 1-Lipschitz but not smooth, while J2(x) = x2/2 is 1-smooth but not Lipschitz. The inf-convolution J1 J2 is the well-known Huber function, which is both 1-Lipschitz and 1-smooth. Bottom row: Convex conjugates of the functions in the top row. The conjugate of J1 J2 is given as the sum of conjugates J 1 (z) = χ{|z| 1} and J 2 (z) = z2/2. To sum up, given a function J, we can always construct a regularized alternative Jβ α that is αLipschitz and β-smooth and has the same minimizers as J. The next question is how to implement the inf-convolution in GAN-like optimization problems. For this, it is convenient to consider the convex conjugate. Recall that the Fenchel Moreau theorem says that there is a duality between a convex function J and its convex conjugate J as J(x) = supz Rd x, z J (z) and J (z) = supx Rd x, z J(x). The important property is that the convex conjugate of the inf-convolution is the sum of convex conjugates, that is, we always have (J R) (z) = J (z) + R (z). This property can be useful for implementing the regularized objective Jβ α as follows. First, we can check that the convex conjugates of the norm and the squared norm are given as ( 2) = χ{ 1} and ( 1 2 2 2. Hence, we have Jβ α(x) := J α 2 1 (x) = sup z: z 2 α x, z J (z) β which means that minimizing Jβ α can be recast in min-max problem with the norm clipping and ℓ2-regularization on the dual variable z. B COMMON GAN LOSSES For completeness and clarity, we explicitly write out the expressions for the losses listed in Table 1. For more detailed computations of optimal discriminators, see Chu et al. (2019); for more details on the convex duality interpretation, see Farnia & Tse (2018). Minimax GAN Goodfellow et al. (2014) originally proposed the minimax GAN and showed that the corresponding loss function for the minimax GAN is the Jensen Shannon divergence, defined as J(µ) := DJS(µ || µ0) := 1 2DKL(µ || 1 2DKL(µ0 || 1 Published as a conference paper at ICLR 2020 where µ0 P(X) is a fixed probability measure (usually the empirical measure of the data), and DKL(µ || ν) is the Kullback Leibler divergence between µ and ν. The optimal discriminator in the sense of Definition 1 is given as 2 log dµ d(µ + µ0)(x), where dµ d(µ+µ0) is the Radon Nikodym derivative. If µ and µ0 have densities µ(x) and µ0(x), then dµ d(µ + µ0)(x) = µ(x) µ(x) + µ0(x), so our optimal discriminator matches that of Goodfellow et al. (2014) up to a constant factor and logarithm. To recover the minimax formulation, the convex duality (4) yields: inf µ DJS(µ, µ0) = inf µ sup ϕ Ex µ[ϕ(x)] ( 1 2Ex µ0[log(1 e2ϕ(x)+log 2)] 1 2 log 2 | {z } (DJS( ,µ0)) (ϕ) = inf µ sup D 1 2Ex µ[log(1 D(x))] + 1 2Ex µ0[log D(x)], using the substitution ϕ = 1 2 log(1 D) 1 Non-saturating GAN Goodfellow et al. (2014) also proposed the heuristic non-saturating GAN. Theorem 2.5 of Arjovsky & Bottou (2017) shows that the loss function minimized is J(µ) := DKL( 1 2µ0 || µ0) = 1 2DKL(µ || µ0) DJS(µ || µ0). The optimal discriminator is 2 log dµ0 d(µ + µ0)(x). Wasserstein GAN Arjovsky et al. (2017) proposed the Wasserstein GAN, which minimizes the Wasserstein-1 distance between the input µ and a fixed measure µ0: J(µ) := W1(µ, µ0) := inf π E(x,y) π[||x y||], where the infimum is taken over all couplings π, probability distributions over X X whose marginals are µ and µ0 respectively. The optimal discriminator Φµ is called the Kantorovich potential in the optimal transport literature (Villani, 2009). The convex duality (4) recover the Wasserstein GAN: inf µ W1(µ, µ0) = inf µ sup ϕ Ex µ[ϕ(x)] (Ex µ0[ϕ(x)] + χ{||ϕ||Lip 1} | {z } (W1( ,µ0)) (ϕ) = inf µ sup ||ϕ||Lip 1 Ex µ[ϕ(x)] Ex µ0[ϕ(x)], an expression of Kantorovich Rubinstein duality. The Lipschitz constraint on the discriminator is typically enforced by spectral normalization (Miyato et al., 2018), less frequently by weight clipping (Arjovsky et al., 2017), or heuristically by gradient penalties (Gulrajani et al., 2017) (although this work shows that gradient penalties may serve a different purpose altogether). Maximum mean discrepancy Given a positive definite kernel K : X X R, the maximum mean discrepancy (MMD, Gretton et al. (2012)) between µ and ν is defined by 2MMD2 K(µ, ν) := 1 Z K(x, y) (µ ν)(dx) (µ ν)(dy). where (H, H) is the reproducing kernel Hilbert space (RKHS) for K. The generative momentmatching network (GMMN, Li et al. (2015)) and the Coulomb GAN (Unterthiner et al., 2018) use the squared MMD as the loss function. The optimal discriminator in this case is Φµ(x) = Ey µ[K(x, y)] Ey µ0[K(x, y)], Published as a conference paper at ICLR 2020 which in constrast to other GANs, may be approximated by simple Monte Carlo, rather than an auxiliary optimization problem. Note that MMD-GANs (Li et al., 2017a; Arbel et al., 2018) minimize a modified version of the MMD, the Optimized MMD (Sriperumbudur et al., 2009; Arbel et al., 2018). These MMD-GANs are adversarial in a way that does not arise from convex duality, so our theory currently does not apply to these GANs. Integral probability metrics An integral probability metric (M uller, 1997) is defined by J(µ) := IPMF(µ, µ0) := sup f F Z f dµ Z f dµ0, where F is a class of functions. The optimal discriminator is the function that maximizes the supremum in the definition. The Wasserstein distance may be thought of as an IPM with F containing all 1-Lipschitz functions. The MMD may be thought of as an IPM with F all functions with RKHS norm at most 1, but no GANs based on MMD are actually trained this way, as it is difficult to constrain the discriminator to such functions. C OPTIMAL DISCRIMINATORS ARE FUNCTIONAL DERIVATIVES Let J : P(X) R be a convex function. Recall the definition of the optimal discriminator (Definition 1): Φµ arg max ϕ C(X) Z ϕ dµ J (ϕ). This definition can be understood as an infinite dimensional analogue of subgradients. In fact, in finite-dimensional convex analysis, z is a subgradient of f : Rd R if and only if it can be written as z arg maxz z , x f (z ). The calculus of subgradients shares many properties with the standard calculus of derivatives, such as chain rules (Rockafeller & Wets, 1998). This motivate us to investigate derivative-like features of optimal discriminators. We introduce the functional derivative, also known as the von Mises influence function: Definition 3 (Functional derivative). Let J : P(X) R be a function of probability measures. We say that a continuous function Φµ : X R is a functional derivative of J at µ if J(µ + ϵξ) = J(µ) + ϵ Z Φµ dξ + O(ϵ2) holds for any ξ = ν µ with ν P(X). Under this definition, optimal discriminators are actually functional derivatives. Proposition 16 (Chu et al. (2019), Theorem 2). Let J : M(X) R be a proper, lower semicontinuous, and convex function. If there exists a maximizer Φµ of ϕ 7 R ϕ dµ J (ϕ), then Φµ is a functional derivative of J at µ in the sense of Definition 3. The following result relates the derivative of the loss function with the derivative of the optimal discriminator: Proposition 17 (Chu et al. (2019), Theorem 1). Let J : P(X) R be continuously differentiable, in the sense that the functional derivative Φµ exists and (µ, ν) 7 Ex ν[Φµ(x)] is continuous. Let θ 7 µθ be continuous in the sense that 1 h (µθ+h µθ) converges to a weak limit as h 0. Then, we have θJ(µθ) = θEx µθ[ˆΦ(x)], where ˆΦ = Φµθ is treated as a function X R that is not dependent on θ. We use this important computational tool in many of our proofs. For the case of the generator model µθ = fθ#ω, an important consequence of Proposition 17 is that θJ(µθ) = θEz ω[ˆΦ(fθ(z))] = Ez ω[ x=fθ(z)Φµ(fθ(z)) Dθfθ(z)]. We use this fact in the proof of Theorem 1. Published as a conference paper at ICLR 2020 D PROOFS FOR SECTIONS 1 AND 2 The following result is well known in the dynamical systems and the optimization literature. For the sake of completeness, we provide its proof. Proposition 1 (Bertsekas (1999), Proposition 1.2.3). Suppose f : Rp R is L-smooth and bounded below. Let xk+1 := xk 1 L f(xk). Then || f(xk)|| 0 as k . Proof. It is known that if f is L-smooth, then f(y) f(x) + f(x), y x + L holds for any x, y Rd (see e.g. Lemma 3.4 in Bubeck (2015)). Then, we have f(xk+1) f(xk) + f(xk), xk+1 xk + L 2 ||xk+1 xk||2 L|| f(xk)||2 + 1 2L|| f(xk)||2 2L|| f(xk)||2. Summing this inequality over k, we have k=0 || f(xk)||2 f(x0) f(xn), from which we conclude that min 0 k n 1 || f(xk)||2 2L n (f(x0) inf x f(x)). Next, we move on to prove Theorem 1, which we restate here for readability. Theorem 1 (Smoothness decomposition for GANs). Let J : M(X) R be a convex function whose optimal discriminators Φµ : X R satisfy the following regularity conditions: (D1) x 7 Φµ(x) is α-Lipschitz, (D2) x 7 xΦµ(x) is β1-Lipschitz, (D3) µ 7 xΦµ(x) is β2-Lipschitz w.r.t. the 1-Wasserstein distance. Also, let µθ = fθ#ω be a family of generators that satisfies: (G1) θ 7 fθ(z) is A-Lipschitz in expectation for z ω, i.e., Ez ω[ fθ1(z) fθ2(z) 2] A θ1 θ2 2, and (G2) θ 7 Dθfθ(z) is B-Lipschitz in expectation for z ω, i.e., Ez ω[ Dθ1fθ1(z) Dθ2fθ2(z) 2] B θ1 θ2 2. Then θ 7 J(µθ) is L-smooth, with L = αB + A2(β1 + β2). Intuitively, Theorem 1 can be understood as the chain rule. For simplicity, let us consider the smoothness of a composite function D G, where D and G are functions on R. A sufficient condition for D G to be smooth is that its second derivative is bounded. For this, suppose that (i) D is α-Lipschitz and β-smooth and (ii) G is A-Lipschitz and B-smooth. By the chain rule, the second derivative is bounded as (D G) = (D G)G + (D G)(G )2 αB + A2β, which has the same form as the consequence of Theorem 1. For GANs, we need somewhat more involved conditions; We need two types of smoothness (D2) and (D3) for the optimal discriminators. To this end, we utilize the functional gradient point of view that we explained in Appendix C. Published as a conference paper at ICLR 2020 Proof of Theorem 1. First, using the functional gradient interpretation of the optimal discriminator, we have || θ1J(µθ1) θ2J(µθ2)|| = Ez ω Φµθ1(fθ1(z)) Dθ1fθ1 Φµθ2(fθ2(z)) Dθ2fθ2 Proposition 17 = Ez ω Φµθ1(fθ1(z)) (Dθ1fθ1 Dθ2fθ2) + Φµθ1(fθ1(z)) Φµθ2(fθ1(z)) Dθ2fθ2 + Φµθ2(fθ1(z)) Φµθ2(fθ2(z)) Dθ2fθ2 Ez ω[ Φµθ1(fθ1(z)) (Dθ1fθ1 Dθ2fθ2)] (a) + Ez ω[ Φµθ1(fθ1(z)) Φµθ2(fθ1(z)) Dθ2fθ2] (b) + Ez ω[ Φµθ2(fθ1(z)) Φµθ2(fθ2(z)) Dθ2fθ2] . (c) By assumption, there are bounded non-negative numbers α, β1, β2, A, and B such that (D1 ) α = sup µ P(X) sup x µ Φµ(x) 2 , (D2 ) sup µ P(X) x1Φµ(x1) x2Φµ(x2) 2 β1 x1 x2 2 (D3 ) sup x X xΦµ1(x) xΦµ2(x) 2 β2W1(µ1, µ2), (G1 ) A = Ez ω sup θ ||Dθfθ(z)||op, and (G2 ) Ez ω||Dθ1fθ1(z) Dθ2fθ2(z)||op B||θ1 θ2||. Here, we wrote supx µ f(x) for the essential supremum of f w.r.t. µ. From (D1 ) and (G2 ), the first term (a) is bounded as Ez ω[ Φµθ1(fθ1(z)) (Dθ1fθ1 Dθ2fθ2)] αB θ1 θ2 . From (D3 ), (G1 ), and the Cauchy-Schwarz inequality, the second term (b) is bounded as Ez ω[ Φµθ1(fθ1(z)) Φµθ2(fθ1(z)) Dθ2fθ2] Aβ2W1(µθ1, µθ2) Aβ2Ez ω fθ1(z) fθ2(z) A2β2 θ1 θ2 , where the second inequality holds from the following optimal transport interpretation of W1: W1(µθ1, µθ2) = inf γ P(X X): coupling of µθ1 and µθ2 Z x y dγ Ez ω fθ1(z) fθ2(z) . Lastly, from (D2 ), (G1 ) and the Cauchy-Schwarz inequality, the term (c) is bounded as Ez ω[ Φµθ2(fθ1(z)) Φµθ2(fθ2(z)) Dθ2fθ2] Aβ1Ez ω fθ1(z) fθ2(z) A2β1 θ1 θ2 . Combining the above upper bounds for (a) (c), we conclude that || θ1J(µθ1) θ2J(µθ2)|| (αB + A2(β1 + β2)) θ1 θ2 . E PROOFS FOR SECTION 3 In the finite-dimensional case, Proposition 15 says that taking inf-convolution with a coercive regularizer does not change the set of minimizers of the original objective. A similar invariance holds for GAN objectives, which are defined on infinite-dimensional space of signed measures, for the regularizers R1, R2 and R3 introduced in Sections 4 to 6: Published as a conference paper at ICLR 2020 Proposition 2 (Invariance of minimizers). Let R1(ξ) := ξ KR, R2(ξ) := ξ S , and R3(ξ) := 1 2 ˆξ 2 H be the three regularizers defined by (8), (12), and (19) respectively. Assume that J : M(X) R has a unique minimizer at µ0 with J(µ0) = 0, and J(µ) c ˆµ ˆµ0 H for some c > 0. Then the inf-convolution J = J R1 R2 R3 has a unique minimizer at µ0 with J(µ0) = 0. Proof. In the following, we endow M(X) with the Gaussian RKHS norm ˆ H, and for convenience will refer to this norm if not otherwise specified. To show the result, we apply Lemma 1 three times, first on R3 and R1, and then again on R3 R1 and R2, and then finally on R3 R1 R2 and J( + µ0). The result follows from noting that J(µ) = [R3 R1 R2 J( + µ0)](µ µ0). In order to apply the lemma, it suffices to show that R3 is uniformly continuous on bounded sets and coercive, and that there exist constants c1 and c2 such that R1(ξ) c1 ξ and R2(ξ) c2 ξ . R3 is uniformly continuous on bounded sets: suppose ξ < C and ξ < C; then |R3(ξ) R3(ξ )| = | 1 2 ξ 2| = | 1 2 ξ ξ , ξ +ξ | C ξ ξ by the Cauchy Schwarz and triangle inequality. R3 is also coercive, as R3(ξ) = 1 2 ξ 2 when ξ . R1 satisfies R1(ξ) c1 ξ by Lemma 7. R2 satisfies R2(ξ) c2 ξ by Lemma 5. Recall that a function F is said to be coercive if F(ξ) when ξ . Lemma 1. Suppose F : M(X) R has a unique minimizer at 0 with F(0) = 0, and is uniformly continuous on bounded sets and coercive. Suppose G : M(X) R has a unique minimizer at 0 with G(0) = 0, and G( ) c for some c > 0. Then the inf-convolution F G has a unique minimizer at 0 with (F G)(0) = 0, and is uniformly continuous on bounded sets, coercive, and real-valued. Proof. From Theorem 2.3 of Str omberg (1996), inf F G = 0, and 0 arg min F G. To show that 0 is the unique minimizer, let ξ arg min F G. From the definition of inf-convolution, for every n there exists an ξn M(X) that satisfies lim n F(ξn) + G(ξ ξn) = (F G)(ξ) = 0. Since F and G are non-negative, F(ξn) 0 and G(ξ ξn) 0. By our assumption that G( ) c , the latter limit implies that ξ ξn 0, which implies that F(ξn) F(ξ) by continuity of F. Comparing our two expressions for the limit of F(ξn), we find that F(ξ) = 0, which implies that ξ = 0, since F has a unique minimizer at 0. Hence arg min F G = {0}. From Theorem 2.10 of Str omberg (1996), F G is uniformly continuous on bounded sets and real-valued. To show that F G is coercive, we show the equivalent condition its sublevel sets are bounded (see Proposition 11.11 of Bauschke & Combettes (2011)). That is, we show that for all a > 0, there exists a constant b such that if (F G)(ξ) a, then ξ b. Let a > 0, and let ξ M(X) be such that (F G)(ξ) a. Let ϵ > 0; from the definition of inf-convolution, there exists a ξ M(X) such that F( ξ) + G(ξ ξ) (F G)(ξ) + ϵ a + ϵ. Because F and G are non-negative, we have that F( ξ) a + ϵ, G(ξ ξ) a + ϵ. Using the fact that F is coercive and hence its sublevel sets are bounded, let b be such that if F(ν) a + ϵ, then ν b . By the triangle inequality and our assumption that G( ) c , ξ ξ + ξ ξ b + G(ξ ξ) c b + a + ϵ Because this constant is independent of ξ, this shows that F G is coercive. Published as a conference paper at ICLR 2020 F PROOFS FOR SECTION 4 Proposition 3. (D1) holds if and only if J is α-Lipschitz w.r.t. the Wasserstein-1 distance. Proof. First, we check the forward implication: |J(µ) J(ν)| αW1(µ, ν) sup x µ xΦµ(x) 2 α. From the duality between L - and L1-norms, we have sup x µ xΦµ(x) 2 = xΦµ( ) L (µ; Rd) = sup v L1(µ; Rd)=1 Z Φµ(x) v(x) dµ(x). (26) Below, we will abbreviate L1(µ; Rd) as L1. We utilize the fact that the optimal discriminator Φµ is the functional gradient of J in the sense of Definition 3. For any t 0, let µt = (id + tv)#µ be the law of x + tv(x) with x µ. Then, µt converges weakly to µ = µ0 as t 0 since W1(µt, µ) t v L1(µ) 0. Applying Proposition 17 to J and µt, we have d dt J(µt) t=0 = d Z Φµ(x + tv(x)) dµ(x) t=0 = Z Φµ(x) v(x) dµ. In particular, the right-hand side of (26) is bounded from above as sup v L1(µ)=1 Z Φµ(x) v(x) dµ(x) = sup v L1(µ)=1 d dt J(µt) t=0 = sup v L1(µ)=1 lim t 0 J(µt) J(µ) sup v L1(µ)=1 lim t 0 αW1(µt, µ) sup v L1(µ)=1 lim t 0 αt v L1(µ) which proves the first half of the statement. For the converse, we borrow some techniques from the optimal transport theory. Let p > 1, and let Wp(µ, ν) denote the Wasserstein-p distance between µ and ν. Suppose that µt : [0, 1] P(X) is an Pp(X)-absolutely continuous curve, that is, every µt has a finite p-moment and there exists v L1([0, 1]) such that Wp(µs, µt) R t s v(r) dr holds for any 0 s t 1. Then, the limit |µ |Wp(t) := limh 0 |h| 1Wp(µt+h, µt) exists for almost all t [0, 1] (see Ambrosio et al. (2008), Theorem 1.1.2). Such |µ |Wp is called the metric derivative of µt. Lemma 2 (Ambrosio et al. (2008), Theorem 8.3.1). Let µt : [0, 1] P(X) be an Pp(X)-absolutely continuous curve, and let |µ |Wp L1([0, 1]) be its metric derivative. Then, there exists a vector field v : (x, t) 7 vt(x) Rd such that vt Lp(µt; X), vt |µ |Wp(t) for almost all t [0, 1], and Z 1 d dtϕ(x, t)µt(dx)dt = Z 1 0 vt(x), xϕ(x, t) µt(dx)dt holds for any cylindrical function ϕ(x, t). Published as a conference paper at ICLR 2020 Given p > 1, let µt be a Pp(X)-absolutely continuous curve such that µ0 = µ and µ1 = ν. Then |J(µ) J(ν)| = d dt J(µt) dt Z vt Φµt dµt dt 0 ||vt||Lp(µt) Φµt Lq(µt) dt 0 ||vt||Lp(µt) dt sup µ Φ µ Lq(µ) 0 |µ t|Wp dt sup µ Φ µ Lq(µ) , where |µ t| is the metric derivative. The existence of vt and the bound ||vt||Lp(µt) |µ t|Wp is due to Lemma 2. Taking the infimum over all possible such curves, we find that |J(µ) J(ν)| Wp(µ, ν) sup µ Φ µ Lq(µ) . To conclude, we consider the limit p 1. Using the fact that q 7 ||f||Lq(µ) is increasing in q, we have |J(µ) J(ν)| Wp(µ, ν) sup µ Φ µ L (µ) αWp(µ, ν). Then |J(µ) J(ν)| inf p>1 αWp(µ, ν) = α inf π Π(µ,ν) inf p>1 ||x y||Lp(π) = α inf π Π(µ,ν) ||x y||L1(π) = αW1(µ, ν). Proposition 4. The minimax and non-saturating GAN losses do not satisfy (D1) for some µ0. Proposition 5. The Wasserstein GAN loss satisfies (D1) with α = 1 for any µ0. Proof for Proposition 4 and Proposition 5. The counterexample for Proposition 4 is a slight simplification of Example 1 of Arjovsky et al. (2017). Let δx denote the distribution that outputs x R with probability 1. Then, we have DJS(δx || δ0) = log 2 x = 0 0 x = 0, DKL( 1 2δ0 || δ0) = x = 0 0 x = 0, and W1(δx, δ0) = |x|. Therefore, for J(µ) = DJS(µ || δ0), the inequality |J(δx) J(δ0)| αW1(δx, δ0) cannot hold for sufficiently small x = 0, because the left-hand side equals log 2 while the right-hand side equals α|x|. For J(µ) = DKL( 1 2δ0 || δ0), the inequality will not hold for any x = 0. Next, we verify Proposition 5. For J(µ) = W1(µ, µ0), we have J(µ) J(ν) = W1(µ, µ0) W1(ν, µ0) W1(µ, ν) by the triangle inequality (see e.g., Section 6 in Villani (2009)). Reversing the roles of µ and ν, we can conclude that |J(µ) J(ν)| W1(µ, ν). The result follows from Proposition 3. Proposition 6 (Pasch Hausdorff). Let J : M(X) R be a function, and define J := J R1. Then J is α-Lipschitz w.r.t. the distance induced by the KR norm, and hence the Wasserstein-1 distance when restricted to P(X). Published as a conference paper at ICLR 2020 Proof. Observe that by the triangle inequality, J(µ) J(ν) = sup ν inf µ J( µ) + αW1(µ, µ) J( ν) αW1(ν, ν) sup ν αW1(µ, ν) αW1(ν, ν) sup ν αW1(µ, ν) = αW1(µ, ν). Interchanging the roles of µ and ν completes the proof. Lemma 3. The convex conjugate of µ 7 α||µ||KR is given as ϕ 7 χ{ ϕ Lip α}. Proof. The convex conjugate is h Z f dµ α||µ||KR i = sup µ h Z f dµ α sup ||g||Lip 1 h Z f dµ sup ||g||Lip α h inf ||g||Lip α Z f dµ Z g dµ i = inf ||g||Lip α sup µ h Z f dµ Z g dµ i = inf ||g||Lip α χ{f = g} = χ{||f||Lip α}, where Sion s minimax theorem ensures that we can swap the inf and the sup. G PROOFS FOR SECTION 5 Proposition 7. The Wasserstein, minimax, and non-saturating GAN losses do not satisfy (D2) for some µ0. Proof. First, we prove the statement for the Wasserstein GAN. Consider J(µ) = W1(µ, δ0) evaluated at µ = 1 2δ1, a mixture of spikes at 1. The optimal discriminator of J at this mixture is the Kantorovich potential that transfers µ to δ0, which is Φµ(x) = |x|. The gradient of this Kantorovich potential is discontinuous at 0, and hence not Lipschitz. For the minimax GAN J(µ) = DJS(µ || µ0), let µ and µ0 be probability distributions on R whose densities are p(x) exp( |x|/2) and p0(x) exp( |x|), respectively (i.e., Laplace distributions). Then, the optimal discriminator at µ is given as Φµ(x) = 1 2 log p(x) p(x)+p0(x) = 1 2 log(1 + 2 exp( |x|/2)). By elementary calculations, we can see that this function is not differentiable at x = 0, which implies that the minimax GAN does not satisfy (D2). For the non-saturating GAN, we obtain a similar result by swapping the role of µ and µ0. Proposition 8. Let J : M(X) R be a convex, proper, lower semicontinuous function, and define J := J R2. Then the optimal discriminator for J is β1-smooth. Proof. J is convex, proper, lower semicontinuous. Hence J(µ) = (J λ|| ||F )(µ) = sup ϕ Z ϕ dµ J (ϕ) χ{ 1 Then by the envelope theorem, δJ δµ is the ϕ that maximizes the right-hand side, and hence 1 Published as a conference paper at ICLR 2020 Lemma 4. The convex conjugate of µ 7 β1 µ S is ϕ 7 χ{ϕ β1S}. Proof. The proof of Lemma 4 is nearly identical to the proof of Lemma 3. Proposition 9. Let f : Rd R be a neural network consisting of k layers whose linear transformations have spectral norm 1 and whose activation functions are 1-Lipschitz and 1-smooth. Then f is k-smooth. Proof. In this section, let ℓ(f) and s(f) denote the Lipschitz constant of f and f respectively. This inequality holds s(f g) s(f)ℓ(g)2 + ℓ(f)s(g), (27) since ||d(f g)(x) d(f g)(y)|| = ||df(g(x))dg(x) df(g(y))dg(y)|| = ||df(g(x))dg(x) df(g(y))dg(x) + df(g(y))dg(x) df(g(y))dg(y)|| ||df(g(x)) df(g(y))|| ||dg(x)|| + ||df(g(y))|| ||dg(x) dg(y)|| s(f)||g(x) g(y)||ℓ(g) + ℓ(f)s(g)||x y|| s(f)ℓ(g)2 + ℓ(f)s(g) ||x y||. Let σ be an elementwise activation function with ℓ(σ) = s(σ) = 1, and let A be a linear layer with spectral norm 1. Then ℓ(A) = 1 and s(A) = 0, so ℓ(σ A) ℓ(σ) ℓ(A) = 1 s(σ A) s(σ)ℓ(A)2 + ℓ(σ)s(A) = 1. We note that we can use the same value of s(σ) whether we consider σ : R R or elementwise as σ : Rd Rd, since for an elementwise σ, we have ||dσ(x) dσ(y)||2 = || diag σ (x) diag σ (y)||2 = || diag(σ (x) σ (y))||2 = max i |σ (xi) σ (yi)| max i s(σ)|xi yi| = s(σ)||x y|| s(σ)||x y||2. We apply the inequality (27) recursively for all k layers to obtain that the entire network is ksmooth. Lemma 5. Let H be an RKHS with the Gaussian kernel. Then ||ˆµ||H 2 σ2 ||µ||S . Proof. It follows from f(x) xi = f, K(x, ) xi that, for f H, f(x) f(y) 2 = xi xi 2 2K(x, y) xi yi + 2K(y, y) f 2 H (Γ(x, y) + Γ(y, x)), where x and y denote copies of x and y, and xi xi 2K(x, y) Published as a conference paper at ICLR 2020 From (30), we find that for a Gaussian kernel, Γ(x, y) = 1 σ4 exp x y 2 2 σ2 f H x y . This implies that ˆµ H = sup f H, f H 1 H PROOFS FOR SECTION 6 Lemma 6. The convex conjugate of µ 7 λ 2 ||µ||2 KR is f 7 1 2λ||f||2 Lip. Proof. This proof is generalized from the finite-dimensional case (Boyd & Vandenberghe, 2004). We can bound the convex conjugate from above by 2 ||µ||2 KR i sup µ h ||f||Lip||µ||KR λ 2 ||µ||2 KR i h ||f||Lipz λ = ||f||Lip ||f||Lip 2 ||f||2 Lip λ2 2λ||f||2 Lip. If f is constant, then we may choose µ = 0 to see that 2 ||µ||2 KR i 0 = 1 2λ||f||2 Lip, and we are done. Otherwise, for f(x) = f(y), let µx,y be the signed measure given by µx,y = ||f||2 Lip λ(f(x) f(y))(δx δy). ||µx,y||KR = sup ||g||Lip 1 = ||f||2 Lip λ(f(x) f(y)) sup ||g||Lip 1 = ||f||2 Lip λ(f(x) f(y)) d(x, y) and Z f dµx,y = ||f||2 Lip λ(f(x) f(y)) f(x) f(y) = 1 λ||f||2 Lip. Published as a conference paper at ICLR 2020 2 ||µ||2 KR i sup f(x) =f(y) h Z f dµx,y λ 2 ||µx,y||2 KR i = sup f(x) =f(y) λ||f||2 Lip λ λ||f||2 Lip d(x, y) f(x) f(y) λ||f||2 Lip λ λ||f||2 Lip 1 ||f||Lip 2λ||f||2 Lip. Proposition 10. Let J : M(X) R be a convex function. Then J satisfies (D3) if and only if DJ(ν, µ) β2 2 µ ν 2 KR for all µ, ν M(X). Proof. Recall the definition of the Bregman divergence: DJ(ν, µ) := J(ν) J(µ) Z Φµ(x) d(ν µ). Proposition 10 claims that the optimal discriminator Φµ is β2-smooth as a function of µ if and only if µ, ν M(X), J(ν) J(µ) Z Φµ(x) d(ν µ) β2 2 µ ν 2 KR. (28) This proof is generalized from the finite-dimensional case (Zhou, 2018; Sidford, 2017). Suppose that Φµ(x) Φν(x) 2 β2 µ ν KR holds for all x. Let µt (0 t 1) be defined as a mixture between µt = (1 t)µ + tν so that µ0 = µ, µ1 = ν, and µt µs KR = |t s| µ ν KR. Then, we have |DJ(ν, µ)| = d dt J(µt) dt Z Φµ d(ν µ) Z Φµt d(ν µ) dt Z Φµ d(ν µ) Z (Φµt Φµ) d(ν µ) dt 0 Φµt Φµ Lip µ ν KR dt 0 β2 µt µ KR µ ν KR dt 0 tβ2 µ ν 2 KR dt 2 µ ν 2 KR. Since the choice of µ and ν is arbitrary, we have proved (28) by assuming (D2). Next, we move on to prove the converse. Before proceeding, note that the Bregman divergence DJ(ν, µ) is always non-negative. In fact, for any µ, ν M(X), we have Z Φµ d(ν µ) = Z Φµ dν J (Φµ) | {z } J(ν) Z Φµ dµ J (Φµ) | {z } =J(µ) Published as a conference paper at ICLR 2020 which implies DJ(ν, µ) 0. Choose ξ M(X) arbitrarily. If ξ = µ, the inequality Φµ Φξ 2 β2 ξ µ KR is trivial, so we assume ξ = µ below. By the non-negativity of the Bregman divergence, we have = J(ν) J(µ) + Z Φµ d(ν µ) J(ξ) + Z Φξ d(ν ξ) + β2 2 ||ξ ν||2 KR J(µ) + Z Φµ d(ν µ) = J(ξ) J(µ) Z Φµ d(ξ µ) + Z [Φξ Φµ] d(ν ξ) + β2 2 ||ξ ν||2 KR. (29) Since the above inequality holds for all ν M(X), the last expression is still non-negative if we take the infimum over all ν. In particular, Z [Φξ Φµ] d(ν ξ) + β2 2 ||ξ ν||2 KR = sup ν M(X) Z [Φµ Φξ] d(ν ξ) β2 2 ||ξ ν||2 KR = sup ζ M(X) Z [Φµ Φξ] dζ β2 2 ||ζ||2 KR 2β2 Φµ Φξ 2 Lip , using Lemma 6 for the last equality. Continuing from (29), we have 0 J(ξ) J(µ) Z Φµ d(ξ µ) 1 2β2 Φµ Φξ 2 Lip . Swapping the roles of µ and ξ, we obtain a similar inequality. Adding both sides of thus obtained two inequalities, we obtain 0 Z [Φξ Φµ] d(ξ µ) 1 β2 Φµ Φξ 2 Lip . Finally, we have Φµ Φξ 2 Lip β2 Z [Φξ Φµ] d(ξ µ) β2 Φµ Φξ Lip ξ µ KR, and hence we have the desired result: Φµ Φξ Lip β2 ξ µ KR. Lemma 7. Let K(x, y) = exp( ||x y||2 2σ2 ). Then MMD2 K(µ, ν) 1 σ2 ||µ ν||2 KR. Proof. First, we note that MMD2 K(µ, ν) = Z K(x, y) (µ ν)(dx) (µ ν)(dy) R (K(x, y) K(x , y)) (µ ν)(dy) i ||µ ν||KR h sup x =x ,y =y K(x, y) K(x, y ) K(x , y) + K(x , y ) d(x, x ) d(y, y ) i ||µ ν||2 KR. Published as a conference paper at ICLR 2020 Next, we show that c = sup x =x ,y =y K(x, y) K(x, y ) K(x , y) + K(x , y ) d(x, x ) d(y, y ) = 1 Because K is differentiable, it suffices to check that the operator norm of xi yj K(x, y) is bounded. To see this, note that c = sup x =x Lip K(x, ) K(x , ) = sup x =x ,y || y K(x, y) y K(x , y)||2 sup x,y || x y K(x, y)||op, where we obtained by the last equality by applying the vector-valued mean value theorem (Rudin, 1964) to t 7 y K((1 t)x + tx , y), thereby obtaining || y K(x, y) y K(x , y)||2 || t y K((1 t)x + tx , y)||2 = ||(x x) x y K((1 t)x + tx , y)||2 ||x x||2|| x y K((1 t)x + tx , y)||op. xi yj K(x, y) = xi yj exp 1 k (xk yk)2 1 k (xk yk)2 h 1 σ4 (xi yi)(xj yj) 1 This is the symmetric matrix 1 σ2 exp ||x y||2 σ2 (x y)(x y)T I i . By inspection, we see that x y is an eigenvector with eigenvalue 1 σ2 exp( ||x y||2 σ2 ||x y||2 1), and the other eigenvectors are orthogonal to x y with eigenvalues 1 σ2 exp( ||x y||2 Setting z = ||x y||2 2σ2 and taking the absolute value of the eigenvalues, we see the maximum operator norm over all x, y is equal to max n sup z 0 1 σ2 e z|2z 1|, sup z 0 1 σ2 e zo = 1 Therefore c 1 It is not strictly necessary for the proof to show equality. However, to show equality, let u be a unit vector, and set x = y = 0, and x = y = tu. Then K(tu, tu) K(tu, 0) K(0, tu) + K(0, 0) d(tu, 0) d(tu, 0) 2 2 exp( t2 lim t 0 2 2 exp( t2 = lim t 0 2 exp( t2 Published as a conference paper at ICLR 2020 where we used l Hˆopital s rule to evaluate the limit. Proposition 11. The minimax and non-saturating GAN losses do not satisfy (D3) for some µ0. Proof. For the minimax loss, the Bregman divergence is: DDJS( || µ0)(ν, µ) = Z h1 2 log µ0 1 2µ0 + 1 2 log ν 1 2µ0 + 1 2 log µ0 1 2µ0 + 1 2 log µ 1 2µ0 + 1 2 log µ µ0 + µ d(ν µ) Z log µ0 + µ µ0 + ν dµ0 + 1 Z h log ν 1 2µ0 + 1 2ν log µ µ0 + µ Z log µ0 + µ µ0 + ν d(µ0 + ν) + 1 2DKL(ν || µ), and for the non-saturating loss, the Bregman divergence is 2 µ0 || µ0)(ν, µ) = Z log 2 log µ0 µ0 + µ d(ν µ) Z log µ0 + ν µ0 + µ dµ0 + 1 2ν µ0 + µ dν 1 Choosing ν to be not absolutely continuous w.r.t. 1 2µ0 makes the Bregman divergence , which is sufficient to show that the Bregman divergence is not bounded by ||µ ν||2 KR. Proposition 12. The MMD loss with Gaussian kernel satisfies (D3) with β2 = 2π for all µ0. Proof. We work with the Gaussian kernel K(x, y) = e ||x y||2 2σ2 and use Proposition 10: 2 MMD2( ,µ0)(ν, µ) = 1 Z K(x, y) (ν µ0)(dx) (ν µ0)(dy) Z K(x, y) (µ µ0)(dx) (µ µ0)(dy) Z (K(x, y) µ(dy) K(x, y) µ0(dy)) (ν µ)(dx) Z K(x, y) (ν µ)(dx) (ν µ)(dy) 2MMD2(µ, ν) 1 2σ2 ||µ ν||2 KR, where the last line is from Lemma 7. The result follows with σ = (2π) 1/2. The result actually applies more generally for the MMD loss with a differentiable kernel K that satisfies β2 := supx,y x y K(x, y) 2 < . Published as a conference paper at ICLR 2020 Proposition 13 (Moreau Yosida regularization). Suppose J : M(X) R is convex, and define J := J R3. Then J is convex, and D J(ν, µ) β2 2 ||µ ν||2 KR. Proof. We work on the more general case of K(x, y) = (2πσ2) d/2 exp( x y 2 2σ2 ), and define J(µ) = inf µ J( µ) + c L 2 MMD2 K(µ, µ) for c = σ2(2πσ2)d/2. Let µ be the unique minimizer of the infimum, which exists because the function is a strongly convex. By the envelope theorem, we compute that d dϵ J(µ + ϵχ) ϵ=0 = d 2 ||µ + ϵχ µ ||2 H ϵ=0 2 ||µ µ ||2 H + 2ϵ µ µ , χ H + ϵ2||χ||2 H ϵ=0 = c L µ µ , χ H = c L Z (µ µ ) dχ, so δ J δµ = c L(µ µ ). D J(ν, µ) = J(ν) J(µ) Z δ J = inf ν J( ν) + c L 2 ||ν ν||2 H h J(µ ) + c L 2 ||µ µ ||2 H i c L µ µ , ν µ H J(µ ) + c L 2 ||ν µ ||2 H h J(µ ) + c L 2 ||µ µ ||2 H i c L µ µ , ν µ H 2 ||µ ν||2 H 2 (2πσ2) d/2 1 σ2 ||µ ν||2 KR 2 ||µ ν||2 KR, where we used Lemma 7 for the second-to-last line. The proposition follows for σ = (2π) 1/2. Regarding this choice of σ, it will turn out that in the general case, the dual penalty includes a numerically unfavorable factor of (2πσ2) d/2, dependent on the dimension of the problem. In practical applications, such as image generation, d can be quite large, making the accurate computation of (2πσ2) d/2 completely infeasible. For numerical stability, we propose choosing the critical parameter σ = (2π) 1/2, which corresponds to the dimension-free kernel K(x, y) = e π||x y||2. Lemma 8. Let H be an RKHS with a continuous kernel K on a compact domain X. The convex conjugate of µ 7 λ 2 ||µ||2 H is f 7 1 2λ||f||2 H + χ{f H}. Published as a conference paper at ICLR 2020 Proof. First we show an upper bound for f H: 2 ||µ||2 H i = sup µ 2 ||µ||2 H i h ||f||H||µ||H λ 2 ||µ||2 H i h ||f||Hz λ = ||f||H ||f||H 2 ||f||2 H λ2 2λ||f||2 H. To derive a lower bound for general f C(X), we use the Mercer decomposition of the positive definite kernel K, i=1 γiφi(x)φi(y) where γi 0 are eigenvalues and {φi} i=1 is a complete orthonormal sequence of L2(X; dx). It is well-known that the corresponding RKHS is given by g L2(X, dx) | (g, φi)2 L2(X) γi < and the norm of g H by (g, φi)2 L2(X) γi . Let f C(X) L2(X, dx) be an arbitrary function. We replace the supremum of the conjugate function 2 ||ˆµ||2 H i (31) by µ M(X) that has a square-integrable Radon Nykodim derivative with respect to the Lebesgue measure dx, and then we have a lower bound. We use µ(x) for the Radon Nykodim derivative with slight abuse of notation. Suppose f(x) = P j=1 ajφj(x) and µ(x) = P j=1 bjφj(x) are the expansion. Note that, since the kernel embedding is given by ˆµ(x) = Z K(x, y)µ(y) dy = j=1 γjbjφj(x), the above maximization is reduced to This is maximized when bj = aj/(λγj), and the maximum value is a2 j γj . (32) If f H, this value is finite, and the lower bound of the conjugate given by is 1 2λ f 2 H, which is the same as the upper bound. If f / H, the value (32) is infinite, and the conjugate function takes + . Published as a conference paper at ICLR 2020 Proposition 14. Let H be an RKHS with the Gaussian kernel K(x, y) = e π||x y||2. Then for f H, k=0 (4π) k X 1 Qd i=1 ki! || k1 x1 kd xdf||2 L2(Rd) (21) = ||f||2 L2(Rd) + 1 4π|| f||2 L2(Rd) + 1 16π2 || 2f||2 L2(Rd) + other terms. (22) Proof. We consider the more general case of K(x, y) = (2πσ2) d/2e ||x y||2/2σ2, where we have 1 Qd i=1 ki! || k1 x1 kd xdf||2 L2(Rd) = ||f||2 L2(Rd) + 1 2σ2|| f||2 L2(Rd) + 1 4σ4|| 2f||2 L2(Rd) + other terms. Novak et al. (2018) prove this result for σ = 1. We sketch the proof for the general case here. We use the Fourier transform convention that ˆf(k) = (2π) d/2 Z Rd f(x) e ik x dx, f(x) = (2π) d/2 Z Rd ˆf(k) eik x dk. Consider the inner product Rd eσ2||k||2/2 ˆf(k) ˆg(k) dk, defined for functions ||f|| < . Expanding the exponential in Taylor series, this inner product gives the equation for the norm in the proposition. By use of the Fourier inversion formula, it can be shown that f, Kx = f(x), where Kx(y) = (2πσ2) d/2e ||x y||2/2σ2, so this is an RKHS with the Gaussian kernel.