# eulerlagrange_analysis_of_generative_adversarial_networks__13fc9d49.pdf

Journal of Machine Learning Research 24 (2023) 1-100 Submitted 12/20; Revised 2/23; Published 4/23

Euler-Lagrange Analysis of Generative Adversarial Networks

Siddarth Asokan siddartha@iisc.ac.in Robert Bosch Centre for Cyber-Physical Systems Indian Institute of Science Bengaluru-560012, India

Chandra Sekhar Seelamantula css@iisc.ac.in Department of Electrical Engineering Indian Institute of Science Bengaluru-560012, India

Editor: Shakir Mohamed

We consider Generative Adversarial Networks (GANs) and address the underlying functional optimization problem ab initio within a variational setting. Strictly speaking, the optimization of the generator and discriminator functions must be carried out in accordance with the Euler-Lagrange conditions, which become particularly relevant in scenarios where the optimization cost involves regularizers comprising the derivatives of these functions. Considering Wasserstein GANs (WGAN) with a gradient-norm penalty, we show that the optimal discriminator is the solution to a Poisson diﬀerential equation. In principle, the optimal discriminator can be obtained in closed form without having to train a neural network. We illustrate this by employing a Fourier-series approximation to solve the Poisson diﬀerential equation. Experimental results based on synthesized Gaussian data demonstrate superior convergence behavior of the proposed approach in comparison with the baseline WGAN variants that employ weight-clipping, gradient or Lipschitz penalties on the discriminator on low-dimensional data. We also analyze the truncation error of the Fourier-series approximation and the estimation error of the Fourier coeﬃcients in a high-dimensional setting. We demonstrate applications to real-world images considering latent-space prior matching in Wasserstein autoencoders and present performance comparisons on benchmark datasets such as MNIST, SVHN, Celeb A, CIFAR-10, and Ukiyo-E. We demonstrate that the proposed approach achieves comparable reconstruction error and Fr echet inception distance with faster convergence and up to two-fold improvement in image sharpness. Keywords: Generative adversarial networks, Calculus of variations, Euler-Lagrange conditions, Fourier-series approximation, Wasserstein autoencoder.

1. Introduction

The optimization of a generative adversarial network (GAN), originally proposed by Goodfellow et al. (2014), (Standard GAN, or SGAN) is a min-max game between two players a generator (G) and a discriminator (D). The role of the generator is to create fake samples that mimic the ones coming from the training data distribution. The discriminator D is tasked with telling apart the real samples from the fake ones. The optimal G is the one that outsmarts D into confusing the fake samples for real. The SGAN optimization comprises the generator and the discriminator with respective loss functions. The generator

c 2023 Siddarth Asokan and Chandra Sekhar Seelamantula.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/20-1390.html.

Asokan and Seelamantula

G accepts high-dimensional noise z pℓas input and generates fake samples G(z) pg. The discriminator D accepts an input x, which could come from either the data distribution pd, or the generator distribution pg, and outputs a value D(x). Eﬀectively, the generator must learn a mapping from the noise distribution to the data distribution, whereas the discriminator must learn the optimal two-class classiﬁer. Over the past decade, numerous variants of GANs have been proposed with several successful applications. Almost all known GAN ﬂavors minimize either a divergence metric or an integral probability metric. In the following, we review important GANs under each category. Divergence-minimizing GANs: GANs were originally designed to minimize the divergence between the true data distribution pd and the generator distribution pg. For instance, the SGAN formulation minimizes the Jensen-Shannon divergence, whereas the least-squares GAN (LSGAN) (Mao et al., 2017) is optimized for the Pearson-χ2 divergence. The f-GAN formulation (Nowozin et al., 2016) is a generalization that includes a chosen f-divergence. In the original SGAN, the discriminator output is constrained to the interval [0, 1], representing the probability of the input sample being real or fake, whereas LSGAN requires the discriminator output to match the chosen class-labels. In f-GANs, the discriminator output is real-valued, but the activation function maps it to a desired interval. Integral probability metric GANs: In certain GAN ﬂavors, the discriminator is replaced with a real-valued critic C that diﬀerentiates between the generator and data distributions in terms of an integral probability metric (IPM) deﬁned over the class of critics to choose from. The choice of the class of critics gives rise to variants such as the Wasserstein GAN (WGAN) (Arjovsky et al., 2017) with a Lipschitz-1 critic, the minimum-mean discrepancy GAN (MMD-GAN) (Li et al., 2017) where the critic is bounded by a ball in a reproducingkernel Hilbert space, or the Fisher GAN (Mroueh and Sercu, 2017) in which the second-order moments of the critic are constrained to be bounded. Sobolev GANs (Mroueh et al., 2018) favor critics with a ﬁnite energy in the gradient. The critic is a neural network similar to the discriminator, where the constraints are enforced appropriately, either by means of an adjustment of the network weights (Arjovsky et al., 2017; Roth et al., 2019; Wang and Liu, 2016), or through a suitable penalty incorporated into the loss function (Gulrajani et al., 2017; Roth et al., 2017; Mescheder et al., 2018). In our formulation, we use the term discriminator D(x) to refer to either a divergence-based discriminator or the IPM-based critic with the context resolving any ambiguity. In this paper, we will primarily focus on the IPM loss as considered in the context of WGANs. Within this framework, the regularized optimization problem takes the form

min pg max D Ex pd[D(x)] Ex pg[D(x)] + Ω(D(x)) ,

where E denotes the expectation operator and Ωis a suitable regularizer on D. The objective of the min-max optimization is to ensure that the optimal generator distribution p g(x) matches the data distribution pd(x). Typically, one considers gradient-based regularizers, which enforce smoothness on the discriminator. The stupendous success of GANs in generating realistic images has resulted in signiﬁcant eﬀorts trying to explain them analytically. While divergence-minimizing GANs are analyzed in a probabilistic setting, IPM based GANs have been analyzed within the framework of optimal transport (Sanjabi et al., 2018; Bousquet et al., 2017; Lei et al., 2019). The energy

Euler-Lagrange Analysis of GANs

based GAN (EBGAN) (Zhao et al., 2017) and related works (Finn et al., 2016; Che et al., 2020) interpret the GAN loss as an energy function that assigns large values to regions of the manifold where there is a mismatch between the generator and data distributions. The minmax game could also be viewed as an instance of imitation learning (Finn et al., 2016; Ho and Ermon, 2016; Grnarova et al., 2018), drawing parallels to reinforcement and online learning paradigms. GANs have also been analyzed in a game-theoretic setting with convergence to the Nash equilibrium (Oliehoek et al., 2019; Fedus et al., 2018; Gao and Tembine, 2018), and in an information-theoretic setting, for instance, in the context of entropy minimization for interpolation in the latent space (Chen et al., 2016), or stochastic procedures to estimate ratios of densities and functions thereof (Mohamed and Lakshminarayanan, 2016). The convergence guarantees of various GAN training algorithms have been analyzed in a series of contributions (Salimans et al., 2016; Gulrajani et al., 2017; Roth et al., 2017; Kodali et al., 2017; Mescheder et al., 2018; Li et al., 2018).

On GANs and Partial Diﬀerential Equations: GANs have been employed to solve stochastic diﬀerential equations and various classical partial diﬀerential equations (PDEs) encountered in the context of harmonic oscillation, nonlinear oscillation (Yang et al., 2019), and more recently, infectious disease modelling (Randle et al., 2020), to name a few. On the ﬂip side, ordinary diﬀerential equations (ODEs) have been used for GAN training. More speciﬁcally, the stability of GANs has been shown to improve when the discrete gradientdescent based network updates were replaced with a numerical solver for the corresponding ODE (Qin et al., 2020). The discriminator and generator are ﬁrst parametrized as neural networks and subsequently solved for via an ODE. In contrast with these approaches, we view the GAN training problem as a regularized functional optimization problem, which cannot always be addressed using point-wise optimization. The argument becomes more compelling when gradient-based penalties are incorporated. The proposed variational approach is generic and subsumes the unregularized formulations, for instance, the original GAN formulation of Goodfellow et al. (2014). The optimization requires one to solve a PDE. This formalism yields new insights into the interplay between the generator and the discriminator. The Sobolev GAN formulation (Mroueh et al., 2018), which employs the IPM comes closest to our approach. Preliminary connections between the Sobolev GAN and the Fokker-Planck PDE were established in Mroueh et al. (2018). However, an in-depth analysis of the PDE was not carried out, and the connection was not leveraged to optimize the GAN more eﬃciently. Sobolev GAN implementation ultimately relies on an empirical approach for computing the gradient-based penalty in the optimization (Gulrajani et al., 2017). We show how the PDE connection can be leveraged to make the GAN optimization more eﬃcient and insightful.

1.1 Our Contributions

In this paper, we analyze GANs within a variational framework by enforcing the Euler Lagrange (EL) conditions to determine the optimum. In scenarios where the GAN loss does not involve derivative terms, the EL conditions degenerate to performing point-wise optimization. Unlike the existing results in the GAN literature, we explicitly enforce essential conditions, namely non-negativity and area under the density equal to unity, which the optimal generator distribution must satisfy in order to qualify as a valid density. We carry out the analysis for several GAN ﬂavors: SGAN, LSGAN, and f-GANs (Appendix A).

Asokan and Seelamantula

To concretely demonstrate the importance and eﬃcacy of the Euler-Lagrange variational framework, we consider the Wasserstein GAN loss proposed by Arjovsky et al. (2017), but with a diﬀerence we consider a gradient-norm penalty on the discriminator, which is a novel variant of the penalty proposed by Mroueh et al. (2018). The penalized Wasserstein GAN loss necessitates Euler-Lagrange analysis because the new optimization objective involves gradient terms. The chosen gradient-norm penalization has an interesting consequence the optimal discriminator, given the generator, turns out to be the solution to the Poisson PDE. In principle, the optimal discriminator could be determined in a single shot. By analyzing the PDE, ﬁrst in 1-D, and subsequently, generalizing to higher dimensions, we show that the corresponding optimal generator, given the optimal discriminator, learns the desired target distribution (Sections 3 - 4). Our formulation also allows one to determine the optimal Lagrange multiplier in the gradient-norm-penalized Wasserstein GAN loss. Since the GAN optimization alternates between the discriminator and generator, one could start with an initial generator distribution and determine the corresponding optimal discriminator single-shot by solving the Poisson PDE. In the next step, we solve for the generator distribution using the optimal discriminator. Our analysis also shows that the optimal generator distribution coincides with the data distribution. In a real-world scenario, we only have discrete data, and solving the PDE becomes impractical. In such circumstances, we resort to an approximate solution by considering a Fourier-series representation of the discriminator. The choice of the Fourier bases is motivated by the fact that they are eigenfunctions of linear diﬀerential operators. Also, the separability and orthogonality properties of the multidimensional Fourier bases give rise to a computationally elegant approach to ﬁnding the solution, which obviates the need to optimize a neural network. The underlying formulation remains continuous, while the computations are carried out in discrete. We use the Fourier-series approximation mainly to illustrate the point and to serve as a proof of concept, although in principle, one could employ alternative and possibly more parsimonious bases expansions. In order to substantiate the developments, we provide experimental validations employing simulated unimodal and multimodal Gaussian data. Since the Fourier-series model complexity increases exponentially with the dimension of data, we resort to truncation and sampling of the Fourier coeﬃcients. The superior performance of the Fourier-series approximations in lower dimensions motivates us to consider the Wasserstein autoencoder (WAE), where we replace the neural network discriminator with a Fourier-series solver operating in the latent space (Section 6). We present results obtained by training the WAE on several datasets such as MNIST (Le Cun et al., 1998), SVHN (Netzer et al., 2011), Celeb A (Liu et al., 2015), Ukiyo-E (Pinkney and Adler, 2020), and CIFAR-10 (Krizhevsky, 2009). The experiments demonstrate that the Fourier-series based discriminator leads to a faster and stabler convergence of the GAN component in WAE measured in terms of the Fr echet inception distance (FID) (Heusel et al., 2017) and reconstruction error, while also generating substantially sharper images on the datasets considered. The notable aspect is that these advantages accrue even without having to train a discriminator neural network. While our objective is not to avoid training a neural network for the discriminator, the proposed approach gives valuable insights into how closely coupled the discriminator and the generator optimization are, and gives us a deeper understanding of what exactly the neural-network based discriminator is trying to achieve.

Euler-Lagrange Analysis of GANs

The contributions of this paper may be summarized as follows. In the context of gradientnorm penalized WGAN, we show that the optimal discriminator, given the generator, solves a Poisson PDE. The solution relates to potential functions between the generator and data distributions in high dimensions. The solution is obtained using a truncated Fourier-series model, whose coeﬃcients are obtained in closed form. This readily allows one to determine the optimal discriminator given the generator, while training only the generator network. We also show that the optimal value of the Lagrange multiplier can also be computed in closed form using a primal-dual approach. The advantage is that tracking the optimal Lagrange multiplier becomes a viable alternative for measuring training convergence in practical settings. We derive bounds on the errors introduced by the Fourier series truncation and sample estimation. Experimental validations on synthetic Gaussian and real-world datasets show that training a GAN with the proposed Fourier-series based discriminator outperforms baseline methods that consider a neural network for the discriminator. The applicability of the proposed framework is demonstrated for variants of divergence-minimizing GAN losses, with and without regularizers.

2. Mathematical Preliminaries

The cornerstone of our analysis is the Euler-Lagrange (EL) framework, which is at the heart of Calculus of Variations (Gelfand and Fomin, 1964; Mesterton-Gibbons, 2009). The EL conditions are of fundamental importance in solving several problems in physics (Goldstine, 1980; Ferguson, 2004). Consider the functional optimization of a cost L deﬁned as

L y(x), y (x) =

a F x, y(x), y (x) dx, (1)

with respect to y(x), x [a, b], which is assumed to be continuously diﬀerentiable or at least continuous with a piecewise-smooth derivative y (x), with ﬁnite Dirichlet boundary conditions. Let y (x) denote the optimizer of L. The ﬁrst variation of L at the optimum y ,

is deﬁned as the Gateaux derivative δL(y , η) = Lϵ(y )

ϵ=0 , where

Lϵ(y ) = L y (x) + ϵ η(x), y (x) + ϵ η (x) =

a F x, y (x) + ϵ η(x), y (x) + ϵ η (x) dx,

where, in turn, η(x) is a family of compactly supported, inﬁnitely diﬀerentiable functions that are identically zero at the boundaries x = a and x = b. Setting the ﬁrst variation to zero and invoking the fundamental lemma of Calculus of Variations gives rise to the Euler-Lagrange condition. The fundamental lemma of Calculus of Variations states that if a function f(x) satisﬁes the condition

a f(x) η(x) dx = 0

Asokan and Seelamantula

for all compactly supported, inﬁnitely diﬀerentiable functions η(x), then f must be identically zero almost everywhere in [a, b]. The Euler-Lagrange condition that the optimizer y (x) must satisfy is given as follows:

y=y (x) = 0. (2)

In the special case where the cost L does not involve the derivative of y, the EL condition reduces to the degenerate version:

y=y (x) = 0,

which simply corresponds to a point-wise optimization of y over x [a, b]. In the multivariate case, that is, x Rn, the cost is of the type

L y (x) , y i n i=1 = Z

X Rn F x, y, y i n i=1 dx,

where X is the domain of integration and y i denotes the partial derivative of y(x) w.r.t. the ith entry of x, that is, xi. The corresponding EL condition is

y=y (x) = 0. (3)

The EL condition is a ﬁrst-order condition and enforcing it yields the optimum. Whether the optimum corresponds to a minimizer or maximizer of the cost must be checked by invoking the second-order condition, more speciﬁcally the Legendre-Clebsch necessary condition for a

minimizer. In the 1-D case, the condition is given by 2F

y 2 0. In the multivariate setting,

this condition translates to the positive-semi-deﬁniteness (p.s.d.) of the Hessian matrix H of the Hamiltonian H, computed with respect to {y i(x)}n i=1 and evaluated at y(x) = y (x):

y=y 0, where denotes the p.s.d. property. The Hamiltonian is given by

and the entries of the Hessian are given by

[Hy,H]i,j = 2H y i y j .

We now apply the EL conditions to analyze Wasserstein GANs (WGANs) subject to the gradient-norm penalty in Section 3, and present similar analysis for divergence minimizing f-GAN variants in Appendix A and other gradient-regularized GAN losses in Appendix F.

Euler-Lagrange Analysis of GANs

3. Wasserstein GANs

The WGAN minimizes earth mover s distance (EMD) between the generator and the target data distributions, pg and pd, respectively. Earth mover s distance is a special case of the Wasserstein distance between two distributions. Through Kantorovich-Rubinstein duality, the WGAN optimization is speciﬁed via the min-max problem:

max D Ex pd[D(x)] Ex pg[D(x)] ,

which is equivalent to the sequential minimization:

D (x, pg) = arg min D: D L 1 LWGAN D , where LWGAN D = Ex pd[D(x)] + Ex pg[D(x)], and

p g(x) = arg min pg LWGAN G , where LWGAN G = Ex pd[D (x, pg)] Ex pg[D (x, pg)]

where in turn, D(x) L 1 denotes the Lipschitz constraint on the discriminator and D (x, pg) is the optimal discriminator for a given generator distribution pg. The optimal discriminator D is the one that penalizes regions of the input space where pg diﬀers from pd, while satisfying the Lipschitz constraint. The constraint is typically imposed by clipping the weights of the discriminator network. An alternative to weight-clipping is spectral normalization of the weights (Roth et al., 2019). Subsequent works (Gulrajani et al., 2017; Petzka et al., 2018; Terj ek, 2020) replaced the Lipschitz constraint with a gradient penalty to avoid exploding gradients in a neuralnetwork setting. For example, Gulrajani et al. (2017) replaced the Lipschitz-1 penalty with the gradient penalty (WGAN-GP): ( D(x) 2 1)2 =0. It is well-known that a function whose gradient has a bounded norm satisﬁes the Lipschitz constraint (Adler and Lunz, 2018). Table 1 lists a few important gradient-based regularizers proposed in the WGAN literature, which are considered in this paper. The original WGAN-GP empirically evaluated the discriminator gradient on samples drawn from the interpolated distribution αpg + (1 α)pd, 0 α 1, and penalizes values far away from 1 in the norm-squared sense. Petzka et al. (2018) incorporated a one-sided hinge-like penalty in the WGAN-LP formulation (LP stands for Lipschitz penalty). The gradient magnitude is upper-bounded by 1, by penalizing the discriminator only when the gradient magnitude exceeds 1. The gradients were evaluated empirically on an interpolated distribution as in the case of WGAN-GP. In the adversarial Lipschitz regularization proposed in WGAN-ALP (Terj ek, 2020), for a sample drawn from either the data or generator distributions, the regularizer was evaluated along the adversarial penalty direction radv the one along which the Lipschitz constraint is maximally violated. Mroueh et al. (2018) considered a gradient-norm penalty in the Sobolev GAN formulation, where they bounded the energy in the gradient of the discriminator, evaluated with respect to a base measure νp(x). From an implementation standpoint, they considered two base measures: (a) The midpoint distribution νp(x) = pd+pg

2 , which is a special case of the WGAN-GP penalty (Gulrajani et al., 2017); and (b) A noise-convolved version of pd, also considered in DRAGAN (Kodali et al., 2017). Mescheder et al. (2018) employed gradient penalties evaluated independently over real data (WGAN-Rd), over the generated data (WGAN-Rg), or a weighted combination of both (WGAN-Rd Rg) which can be seen as special cases of the Sobolev GAN penalty. Subsequent works extended the Wasserstein-1 distance

Asokan and Seelamantula

WGAN ﬂavor Discriminator loss

WGAN LWGAN D = Ex pd[D(x)] + Ex pg[D(x)]

WGAN-GP LWGAN D + λ Ex αpg+(1 α)pd ( D(x) 2 1)2 ; 0 α 1

WGAN-Rd Rg LWGAN D + λ1

2 Ex pd D(x) 2 2 + λ2

2 Ex pg D(x) 2 2

Sobolev GAN LWGAN D +λ Ex νp(x) D(x) 2 2 , where νp(x) 0; R

X νp(x)dx = 1

WGAN-LP LWGAN D + λ Ex αpg+(1 α)pd (max( D(x) 2 1, 0))2 ; 0 α 1

WGAN-ALP LWGAN D + λ Ex pd

max D(x) D(x+radv)

radv 2 1, 0 2 ,

where radv = max r: r 2>0

n D(x) D(x+r)

WGAN-GNP (Proposed)

LWGAN D + λd

D(x) 2 2 1 dx

Table 1: Discriminator loss functions corresponding to various WGAN variants considered in the literature alongside the proposed WGAN with gradient-norm penalty (WGANGNP). The key diﬀerence lies in how the Lipschitz penalty is enforced on the discriminator. While the vanilla WGAN clips the discriminator network weights, the other WGAN ﬂavors, including ours, consider gradient-based regularization.

based GAN to general Lp-norm spaces (Adler and Lunz, 2018) or propose solving the primal problem through diﬀerentiable Sinkhorn ﬁxed-point iterations (Genevay et al., 2018).

3.1 WGAN with Gradient-norm Penalty

Let X denote the convex hull that contains the supports of pd and pg. In this work, we consider the following gradient-norm penalty (GNP) for the WGAN:

D(x) 2 2 1 dx. (4)

In WGAN-GP, the gradients are evaluated over an interpolated distribution. As in the case of WGAN-Rd or WGAN-Rg, the proposed penalty can be viewed as a particular case of the penalty considered in the Sobolev GAN formulation. While WGAN-Rd and WGAN-Rg enforce the penalty on the supports of pd and pg, respectively, the proposed WGAN-GNP considers a uniform distribution on X, resulting in a closed-form solution to the discriminator, given the generator. In the WGAN-GNP setting, we constrain the generator and data distributions as follows:

Euler-Lagrange Analysis of GANs

Assumption 1 (C1(X) distributions). The generator and data distributions are compactly supported and continuously diﬀerentiable functions.

Incorporating ΩD into the WGAN discriminator cost results in the following optimization problem for the discriminator:

D (x, pd, pg) = arg min D

Ex pd[D(x)] + Ex pg[D(x)] + λd

D(x) 2 2 1 dx

The generator optimization is then given by

p g(x) = arg min pg {LG} , where

LG = Ex pd[D (x, pd, pg)] Ex pg[D (x, pd, pg)] + λp

X pg(x) dx 1

X µp(x)pg(x) dx, (6)

where λp and µp(x) are the Karush-Kunh-Tucker (KKT) multipliers for the integral constraint Ωpg : R

X pg(x)dx = 1, and the non-negativity constraint Φpg : pg(x) 0, respectively. These constraints explicitly enforce pg to be a valid p.d.f. We analyze WGAN-GNP in the one-dimensional setting ﬁrst and subsequently extend the analysis to higher dimensions.

3.2 WGAN-GNP and Euler-Lagrange Conditions in 1-D

In the 1-D setting, the gradient norm penalty in Equation (4) takes the following form:

|D (x)|2 1 dx,

where D denotes the ﬁrst derivative of D. The WGAN-GNP discriminator loss is given by

LD = Ex pd[D(x)] + Ex pg[D(x)] + λd

|D (x)|2 1 dx. (7)

The optimal discriminator in 1-D is given in the following result.

Theorem 1. Optimal WGAN-GNP discriminator (1-D): Consider the optimization of the one-dimensional WGAN-GNP discriminator loss given in Equation (7). The optimal discriminator D (x), given the generator pg, is a solution to one-dimensional Poisson s second-order diﬀerential equation:

D (x) = pg(x) pd(x)

2λd , x X, (8)

and is given by the closed-form solution involving the twice-iterated antiderivatives:

D (x) = 1 2λd

Z Z (pg(x) pd(x)) dx dx x, x X, (9)

where λd is the Lagrange multiplier corresponding to the gradient penalty, and R denotes the antiderivative.

Asokan and Seelamantula

Proof. The integrand in LD in Equation (7) is given by

F x, D, D = D(x)(pg(x) pd(x)) + λd |D (x)|2 1 .

The Euler-Lagrange condition in Equation (2), when applied to F results in the ordinary diﬀerential equation (ODE) given in Equation (8). The homogeneous solution to the 1-D Laplace equation D = 0 takes the form D h(x) = a1x + a0, while the particular solution D p(x, pg, pd) is twice-iterated antiderivative of the right-hand side of Equation (8). The optimal discriminator D (x, pg, pd) is the sum of the homogeneous and particular solutions:

D (x, pg, pd) = D p(x, pg, pd) + D h(x)

Z Z (pg(x) pd(x)) dx dx + a1x + a0.

The constants a0 and a1 must be estimated based on the boundary conditions. Upon convergence of the GAN, the optimal generator distribution must match the data distribution, that is, p g = pd. In this scenario, the optimal discriminator D opt(x) = D x, p g, pd , which implies that the particular solution D p = 0 and D opt(x) is the solution to Laplace s equation, that is, D opt(x) = D h(x). Enforcing the gradient-norm penalty: |(D opt) |2 = 1 yields a1 = 1. The choice of a0 remains free, as the gradient-norm penalty is satisﬁed for all a0 R, which merely oﬀsets D by a constant. Without loss of generality, we set a0 = 0. Thus, we obtain the optimal closed-form WGAN-GNP discriminator for a given generator, given in Equation (9).

While the homogeneous component can be estimated based on the form of ΩD, we will show (in Theorem 2) that the generator optimization is independent of Dh. Therefore, the choice of a1 and a0 are inconsequential to the GAN training. While D h(x) is independent of the distributions, we note that the particular solution D p (x, pg, pd) and thereby, the sum D (x, pg, pd) depend on the data being modelled, and the generator distribution from which the samples are provided. Henceforth, we use the notation D p(x) and D (x) for brevity, while bearing in mind that the optimal discriminator is always determined for a given pair of generator and data distributions. We now present an alternative, but equivalent, formulation of the optimal discriminator involving the fundamental solution of D (x) = δ(x), where δ(x) denotes the Dirac-delta distribution. The reason for providing the alternative solution is that it readily generalizes to higher dimensions.

Lemma 1. Fundamental WGAN-GNP discriminator: Given the 1-D Laplace equation in the distributional setting D (x) = δ(x) together with its fundamental solution φ(x) = r(x) + b1x + b0, where r(x) is the one-sided ramp function r(x) = x for x 0, and 0 elsewhere. The solution to the second-order ODE in Equation (8) is given by the convolution integral:

D (x) = 1 2λd (φ (pg pd)) (x) + a1x + a0. x X. (10)

Proof. The fundamental solution is obtained by taking the second-order antiderivative of δ(x) by interpreting the derivative in the distributional sense. The antiderivative of δ(x) is

Euler-Lagrange Analysis of GANs

the Heaviside unit-step function, and in turn, the antiderivative of the unit-step function is the one-sided ramp function r(x). This yields φ(x) = r(x) + b1x + b0. The symmetric fundamental solution φ(x) = 1

2|x| can be obtained as a special case by setting b1 = 0.5. The solution to Equation (8) can be obtained by convolving the right-hand side of (8) with the fundamental solution φ(x) (Evans, 2010). Including the homogeneous solution a1x + a0 results in Equation (10).

We would like to remark that the convolutional form of D (x), which solves Poisson s PDE, is a special case of the Riemann-Liouville integral (Stein, 1970) given by

Iα 1 [f](x) = cα 1

0 (x y)α 1f(y) dy,

where α is the order of the derivative of f, α [0, 1], and D p(x) = I2 1[pg pd](x).

Lemma 2. Optimal Lagrange multiplier λ d (1-D): The optimal Lagrange multiplier for the one-dimensional discriminator function given in Equation (10) is

X ((sgn (pg pd)) (x) + a1)2 dx , (11)

where sgn(x) = x |x| denotes the signum function, |X| is the length of the interval X in 1-D, and the positive root results in a D (x) that minimizes LD.

Proof. Enforcing the gradient-norm penalty on D (x) yields the optimal Lagrange multiplier λ d. The positive root is chosen based on the Legendre-Clebsch second-order condition (cf. Section 2), which yields 2λ d > 0 for D to be a minimizer of LD. The proof is provided in Appendix B.1.

Given the optimal discriminator D , recall the Lagrangian of the WGAN generator cost given in Equation (6):

LG = Ex pd[D (x)] Ex pg[D (x)] + Z

X (λp + µp(x)) pg(x) dx λp,

where λp and µp(x) are the KKT multipliers associated with the integral and non-negativity constraints, respectively. The following result presents the optimal WGAN-GNP generator given the optimal discriminator D (x).

Theorem 2. Optimal WGAN-GNP generator (1-D): Consider the optimization of the integral cost LG given by

X (D (x)(pd(x) pg(x)) + (λp + µp(x))pg(x)) dx λp,

where D (x) is given in Equation (10), and λp and µp(x) are the KKT multipliers satisfying < µp(x) 0, µp(x)pg(x) = 0, x X, and λp is a ﬁnite real value. Minimization of LG yields

p g(x) = pd(x), and µ p(x) = 0, x X,

and the solution is optimal for all ﬁnite real values of λp.

Asokan and Seelamantula

Proof. LG involves a convolution term inside the integrand. Hence, it is not in a form where the Euler-Lagrange conditions can be readily applied. Early manifestations of such cost functions was in the context of elastodynamics (Gurtin, 1964a) and initial-value problems (Gurtin, 1964b). In order to solve such problems, Gurtin introduced variational analysis of convolutional costs starting from ﬁrst principles and developed the corresponding counterpart of the fundamental lemma of calculus of variations. We adopt a similar approach by evaluating the ﬁrst variation of LG and then applying the fundamental lemma of calculus of variations (cf. Section 2). The result is the following equation:

(φ (pd p g))(x) = µp(x) + λp + a1 x

where φ is the fundamental solution to the Laplace equation. Considering the Laplacian on both sides gives p g(x) = pd(x) cµ p(x). Enforcing the integral and non-negativity constraints on p g(x) results in the desired optimum p g(x) = pd(x). The optimal solution p g is independent of the homogeneous component of the discriminator Dh(x). The detailed derivation is provided in Appendix B.2.

3.3 Constraint Space of the Discriminator (1-D)

The appropriate class of functions that the discriminator must be drawn from depends on the choice of the regularizer. For gradient-based regularizers, Sobolev spaces are most appropriate (Mroueh and Sercu, 2017; Mroueh et al., 2018). The ﬁrst-order L2-normed Sobolev Space W 1,2(X, ν) consists of ﬁnite-energy functions whose ﬁrst-order derivative is also of ﬁnite energy with respect to measure ν deﬁned over X. Consider the discriminator D W 1,2 with the Sobolev norm

D W 1,2 = q

D 2 2,ν + D 2 2,ν =

X |D(x)|2 dν + Z

X |D (x)|2 dν < .

In Sobolev GAN, Mroueh et al. (2018) consider the space W 1,2 0 , which consists of functions from W 1,2 that disappear on the boundary of a compact domain X. Consequently, invoking the Poincar e inequality, it suﬃces to consider functions with ﬁnite energy in the gradient (Sobolev, 1963) and the semi-norm D W 1,2 0 r D 2,ν, for some positive constant r.

The WGAN-GNP discriminator can therefore be interpreted as belonging to W 1,2 0 (X, UX ), where UX denotes the uniform measure over X.

3.4 Fourier-series Approximation

The closed-form optimal discriminator given by Equations (9) or (10) involves evaluating an integral. Further, in a practical GAN setting, we do not have access to pd and pg in closed form. Hence, in practice, one must resort to alternative approaches to solve the ODE in Equation (8). Typical alternatives include basis function expansions or discretization of the diﬀerential operators, which gives rise to a ﬁnite-diﬀerence equation. The existing GAN approaches, in particular, the WGAN approaches, employ a neural network for the discriminator and optimize it, agnostic to the underlying diﬀerential equation formulation of the discriminator. On the other hand, we prefer to solve the diﬀerential

Euler-Lagrange Analysis of GANs

equation, even if approximately, instead of employing a neural network. We use a Fourier bases expansion to solve the diﬀerential equation to serve as a proof of concept, although other basis expansions could be considered as well. Our approach is motivated by that of Fourier himself who introduced the series expansion to solve the heat equation in a metal (Fourier, 1807). The choice of Fourier bases is motivated by the fact that they are eigenfunctions of the Laplace operator, and the multidimensional Fourier bases can be expressed as a tensor product of the univariate counterparts, which simpliﬁes computation of the Fourier coeﬃcients. We refer to WGAN-GNP with the Fourier solver as WGAN-FS. From Assumption 1 and Section 3.3, the data and generator distributions, and the discriminator functions admit valid Fourier series expansions. Consider the Fourier-series expansions of the data density, generator density, and the discriminator, respectively

m Z αmejωomx, pg(x) = X

m Z βmejωomx, and D(x) = 1

m Z γmejωomx,

respectively. The Fourier-series model assumes periodicity and the fundamental frequency ωo has to be speciﬁed. Strictly speaking, although pd, pg, and D(x) are not periodic, we are concerned with these expansions only over a certain domain of interest. This is also considered as the fundamental period, which could be determined based on prior knowledge of the data being modelled. Substituting the Fourier expansions in Equation (8) gives

λd γm exp(jωomx) = βm αm

exp(jωomx), m = 0,

2ω2om2 , m = 0, (12)

which are the Fourier coeﬃcients corresponding to the discriminator, except γ0. While the value of γ0 can be determined based on the boundary conditions on D(x), we observed experimentally that it merely introduces a constant oﬀset in the optimal discriminator, and can therefore be ignored when training the generator. The Fourier-series approximation speciﬁes the discriminator by relating its Fourier coeﬃcients to those of pd and pg. The advantage of the Fourier expansion of a function is that the derivatives also admit an expansion in the same bases, so long as they have ﬁnite L2 norm. Clubbing the homogeneous solution a1 x + a0 with the particular solution obtained above gives the general solution. Without loss of generality, we set a0 = 0 and a1 = 1. The optimal WGAN-FS discriminator, given the generator (equivalently, its Fourier coeﬃcients), takes the form:

D FS(x) = x + 1 λ FS

m Z {0} γmejωomx,

where λ FS is the optimal Lagrange multiplier corresponding to the Fourier-series discriminator. The value of λ FS must be determined by enforcing additional conditions such as the gradient-norm penalty. This aspect will be discussed in Section 3.6.

Asokan and Seelamantula

3.5 Fourier Series of a Probability Density Function

Consider the Fourier-series expansion of a compactly supported density p(x) over [0, T]:

m Z amejωomx, x [0, T].

The Fourier coeﬃcient am is given by

0 p(x)e jωomx dx = 1 T E x p

e jωomx = 1 T ϕ p(ωom),

where ϕ p(ω) denotes the complex conjugate of the characteristic function ϕp(ω) = E[ejωx]. Eﬀectively, the coeﬃcient am is determined by uniformly sampling the characteristic function at ω = ωom, m Z. Similarly, αm = 1

T ϕ pd(ωom) and βm = 1

T ϕ pg(ωom), where ϕpd and ϕpg are the characteristic functions of pd and pg, respectively. For an inﬁnitely supported density, such as the Gaussian, one could consider a suﬃciently large interval (T > 10σ) about the mean for truncation. Then, the above expansion will hold only approximately, and in the limiting sense (T ), the series approximation approaches the Fourier transform. Thus, the discriminator can be computed in closed-form given the generator and data distributions. The diﬀerential equation formalism and the Fourier-series approximation obviate the need for training a neural-network-based discriminator.

3.6 Practical Considerations in 1-D

For distributions whose characteristic function can be evaluated in closed-form, the computation of the Fourier coeﬃcients and thereby the discriminator is straightforward. In practice, the inﬁnite-order summations are truncated. Further, the characteristic function is not often available in closed form and instead, only data is given. In this scenario, we replace the Fourier coeﬃcients am = 1

T ϕ p(ωom) with their sample estimates as follows:

xk D e jωonxk,

D = {xk | k = 1, 2, . . . , N} being the given data that follows the distribution p(x). The convergence properties of empirical characteristic functions have been extensively explored in the literature (Giardina and Chirlian, 1972; Feuerverger and Mureika, 1977). A 1D function p(x) C1(X) supported over X is of bounded variation VX [p(x)] = R

X |p (x)| dx B. The mean-squared error ϵ2 p in truncating the series to M terms is bounded above by B2 πωo M (Giardina and Chirlian, 1972). For D FS(x), the mean-squared error is given by ϵ2 D c ω3o M3 , for some positive constant c. The proof is included in Appendix D.1. We observe that the error in approximating D decays as 1 M3 . Therefore, we expect that the truncated series will yield accurate approximations even for moderate M. Experimental results in Appendix E.1 support this claim. Further, from an implementation perspective, as in the case of Tensor Flow (Abadi et al., 2016), one could avoid complex arithmetic by using a trigonometric Fourier-series expansion

Euler-Lagrange Analysis of GANs

of the type

m=1 ar m cos(ωomx) + ai m sin(ωomx),

where ar m = Real{ am} and ai m = Imag{ am}. Considering truncated trigonometric Fourierseries expansions of pg and pd, and accounting for the homogeneous solution, we obtain

D FS(x) = x + 1 λ FS

m=1 γr m cos(ωomx) + γi m sin(ωomx).

It has been reported in the literature that employing the ideal discriminator for a given generator prevents stable training of the generator due to vanishing gradients (Liu et al., 2017), while a very coarse approximation might not capture the modes present in the data distribution (Daskalakis et al., 2018). Hence, a smooth approximation of the ideal discriminator is considered a good compromise this is precisely what the truncated Fourier-series approximation implicitly also achieves. Enforcing the gradient-norm penalty on D FS(x) enables one to determine the optimal λ FS associated with the Fourier-series discriminator. In terms of the data D, λ FS can be approximated as follows:

v u u u t(2M + 1)

m=1 (τ im + τ rm) + 1

m=1 (τ im τ rm) cos(2ωomxk)

where τ r m = 1

2(γr mωom)2, and τ i m = 1

2(γi mωom)2. The derivation is provided in Appendix B.3. Similar to the Fourier coeﬃcients of D FS(x), we observe that τ r m and τ i m are functions of γr m and γi m, respectively, which in turn depend on the Fourier coeﬃcients of pd and pg. As the generator distribution converges toward the optimal solution p g = pd, λ d in Equation (11) as well as λ FS converge to zero. Therefore, monitoring λ FS in a WGAN-FS training scenario serves as a practical alternative to computing the divergence between pg and pd. As shown in Section 6.2 and Appendix E.1, this alternative is particularly useful when dealing with real-world images with no closed-form representation for the underlying distribution.

3.7 Illustration Using Synthetic 1-D Data

As a preliminary validation, we compare the performance of the proposed WGAN-FS on the 1-D Gaussian learning task.

Baselines: We compare WGAN-FS with the following two categories of baselines: (i) WGAN and its variants with diﬀerent penalties, such as the gradient penalty (WGANGP), Lipschitz penalty (WGAN-LP), Sobolev GAN and stable alternatives to GP, such as WGAN-Rd and WGAN-Rg; and (ii) base WGAN with variations of the proposed gradientnorm penalty (GNP), evaluated empirically on sample points drawn from the two datasets. WGAN-GNP implements the WGAN-GP algorithm with the GNP cost. While we compute the optimal Lagrange multiplier λd in closed-form in WGAN-FS, in Sobolev GANs, λd is

Asokan and Seelamantula

optimized to maximize the discriminator loss through stochastic gradient-descent (Mroueh et al., 2018). Recently, multi-layer networks with periodic sinusoidal activations (SIREN) have been shown to achieve state-of-the-art performance in learning image, sound and waveﬁeld representations (Sitzmann et al., 2020). We therefore adopt two variants of SIREN for the discriminator: (a) A three-layer fully connected network with sin activation, called WGAN-GNP (3S); and (b) A single-layer fully connected network with sin activation and the same number of nodes as terms in the Fourier-series expansion, called WGAN-GNP (1S). Training WGAN-GNP (1S) is equivalent to learning the Fourier coeﬃcients in the WGAN-FS formulation.

Experimental setup: The generator in all GAN variants is considered to be a linear transformation of the input: y = w z + b. Gaussian training data is drawn from N(10, 1), while noise z that is input to the generator is sampled from the standard Gaussian N(0, 1). While WGAN-FS uses a closed-form Fourier-series discriminator, the baselines use a threelayer fully connected discriminator network with leaky Re LU activation. The batch size is 500. For the baseline techniques, each training step involves 5 iterations of the discriminator network optimization followed by one iteration of the generator. WGAN-FS, on the other hand, uses a single-shot discriminator during each training step. Based on additional experiments conducted in Appendix E.1, we set the period T = 2π

ωo to 15 and the truncation frequency M to 10. The Adam optimizer (Kingma and Ba, 2015) is used with a learning rate η = 0.05, and the exponential decay parameters for the ﬁrst and second moments are β1 = 0.5 and β2 = 0.999, respectively. The implementation was carried out using Tensor Flow 2.0 (Abadi et al., 2016).

Results: Figures 1(a) and (b) show the discriminators learnt by the various GANs under consideration. The optimal classiﬁer between the two Gaussians is also plotted for the sake of reference. All classiﬁer outputs are rescaled to [ 0.5, 0.5] to facilitate comparison. While the discriminator peak values depend on the network architecture for the baselines, it is a function of the period T and the constant a0 in WGAN-FS. We observe that the discriminator in WGAN-FS is closer to the optimal classiﬁer than the ones learnt by the baseline WGAN variants. In the case of GNP variants, WGAN-GNP is comparable to the best case baseline WGAN-LP. The WGAN-GNP (3S) model is able to localize the target distribution, but the mismatch between the periodicity of the data and the activation function results in undesirable harmonics in D(x). On the other hand, WGAN-GNP (1S), due to its Fourier-series structure, comes closest to the WGAN-FS discriminator. This experiment shows that solving the diﬀerential equation single-shot, even if approximately, is a more accurate alternative to training a network for the discriminator.

Figures 1(c) and (d) compare the Wasserstein-2 distance (W2,2) between the target and generator Gaussians. These plots show that WGAN-FS converges much faster than the baseline techniques. This is a direct consequence of solving the diﬀerential equation within the training process. The poor performance of WGAN-GNP (3S) and WGAN-GNP (1S) is attributed to the mismatch between the fundamental frequency of the sinusoid and the assumed periodicity. Figures 1(d) and (e) compare the Kullback-Leibler (KL) divergence between the generator and true data distributions. We observe similar performance improvements in WGAN-FS compared with the baselines, as in the W2,2 case.

Euler-Lagrange Analysis of GANs

IDEAL CLASSIFIER

Sobolev GAN

WGAN-GNP(3S)

WGAN-GNP(1S)

DISCRIMINATOR D(x)

DISCRIMINATOR D(x)

500 1000 ITERATIONS

W2,2(pd, pg)

500 1000 ITERATIONS

W2,2(pd, pg)

250 500 750 1000 ITERATIONS

KL DIVERGENCE

250 500 750 1000 ITERATIONS

KL DIVERGENCE

Figure 1: (

Color online) Experiments on 1-D Gaussian data: (a) & (b) Discriminator learnt by WGAN-FS in comparison with those learnt by the baselines and WGANGNP variants. Baseline implementation with empirical gradient estimates learn piecewise linear discriminators, while WGAN-FS and variants of WGAN-GNP with sinusoidal activations learn a smoother function. The WGAN-FS discriminator is closest to the ideal classiﬁer, while being smooth. Wasserstein-2 distance between pd and pg (W2,2(pd, pg)) versus iterations for (c) WGAN-FS and the baseline WGANs and (d) WGAN-FS and WGAN-GNP variants. Kullback-Leibler divergence versus iterations for (e) WGAN-FS and the baseline WGANs and (f) WGAN-FS and WGAN-GNP variants. WGAN-FS attained the lowest (best) values in terms of both metrics substantially faster than the baselines.

Asokan and Seelamantula

4. Multivariate WGAN-GNP

In the n-dimensional WGAN-GNP scenario, the discriminator loss takes the form:

LD = Ex pd[D(x)] + Ex pg[D(x)] + λd

D(x) 2 2 1 dx

D(x) (pg(x) pd(x)) + λd D(x) 2 2 1 dx. (13)

We now determine the optimal discriminator corresponding to the loss given in Equation (13).

Theorem 3. Optimal WGAN-GNP discriminator in n-D: Consider the n-D WGAN discriminator loss subject to the gradient-norm penalty as given by Equation (13). The optimizer of LD solves Poisson s partial diﬀerential equation given by

D (x) = pd(x) pg(x)

where = . = 2 x1 + 2 x2 + . . . + 2 xn denotes the Laplacian operator, with xi being the ith entry of x, and 2 xi = 2 x2 i . The closed-form particular solution is given by the multidimensional convolution integral

D p(x) = 1 2λd

X φ(x y) (pd(y) pg(y)) dy, (15)

where φ(x) denotes the fundamental solution to the Laplace equation: D(x) = δ(x). The fundamental solution is given by

2π ln( x ), for n = 2, and

1 n(n 2)v(n) 1 x n 2 , for n 3, (16)

where x = p

x2 1 + x2 2 + . . . + x2n and v(n) is the volume of the unit sphere in Rn given by v(n) = π n 2 Γ n

2 + 1 1, with Γ(n) denoting the gamma function.

Proof. Consider the integrand in Equation (13): D(x)(pg(x) pd(x)) + λd D(x) 2 2. Applying the Euler-Lagrange condition from Equation (3) for obtaining the optimum results in Poisson s partial diﬀerential equation (PDE) given in Equation (14). A closed-form solution to Poisson s equation is obtained similar to the 1-D case. Solving the n-D inhomogeneous diﬀerential equation D(x) = δ(x) in polar coordinates yields the fundamental solution φ(x) given in Equation (16) (Evans, 2010). The solution to Poisson s equation D(x) = pd(x) pg(x)

2λd is the convolution between φ(x) and pd(x) pg(x)

2λd , which results in Equation (15).

For the speciﬁc case n = 2, we obtain

D p(x) = 1 4πλd

X ln( x y ) (pd(y) pg(y)) dy. (17)

Euler-Lagrange Analysis of GANs

For n 3, we obtain

D p(x) = 1 2λdn(n 2)v(n)

1 x y n 2 (pd(y) pg(y)) dy. (18)

Including the family of homogeneous solutions D h(x) = a, x + constant, the general solution becomes

D (x) = D p(x) + a, x + constant. (19)

As in the 1-D case, upon convergence of the GAN, we expect p g(x) = pd(x). The optimal discriminator in this scenario is given by D opt(x) = D h(x). Enforcing the gradient-norm penalty, we obtain the condition a = 1, with the constant term merely resulting in an oﬀset of D (x). Equations (17) and (18) are speciﬁc instances of the Calderon-Zygmund singular integral (Stein, 1970), with kernels KD,2(x, y) = ln( x y ) and KD,n(x, y) = x y 2 n, respectively, with singularities along x = y. The integrals are evaluated in the Cauchy principal-value sense. D p(x) for n 3 is also a speciﬁc instance of the Riesz potential (Riesz, 1949; Landkof, 1972) given by

Iα n[f](x) = cα n

X Rn x y α nf(y) dy, 0 < α < n,

which, in turn, is an n-dimensional generalization of the Riemann-Liouville integral (Stein, 1970). In the present context, D p(x) = I2 n[pg pd](x). In the language of electrostatics, D p(x) could be interpreted as the diﬀerence between the Newtonian potentials of the functions pg and pd. The charge-free space scenario corresponds to p g = pd, which results in D p(x) = 0. A similar elliptic diﬀerential equation, deﬁned via the Stein operator (Oates et al., 2017), was encountered in the context of Sobolev GANs derived using the Sobolev integral probability metric (Mroueh et al., 2018). Our choice of a uniform prior in ΩD results in the Laplacian operator, and subsequently, Poisson s PDE, which is relatively easier to solve than Stein s operator based elliptic PDE. The optimal Lagrange multiplier λ d associated with optimal WGAN-GNP discriminator D (x) is presented next.

Lemma 3. Optimal Lagrange multiplier λ d (n-D) : Consider the n-dimensional discriminator function given by Equation (19). The associated optimal Lagrange multiplier is given by

Kλ n,i (pg pd) (x) + ai 2 dx,

where |X| denotes the volume of the support X, and Kλ n,i is a singular convolutional kernel given by

Kλ n,i(x) = KD,n(x)

, for n = 2, and

, for n 3. (20)

Asokan and Seelamantula

The positive square-root of λ d results in a D (x) that minimizes the cost function LD given in Equation (13).

Proof. As in the 1-D case, the optimal Lagrange multiplier λ d can be found by enforcing the gradient-norm penalty ΩD on D (x). The choice of the sign of the square-root is given by the second-order (Legendre-Clebsch) condition for a minimizer: 2λ d In 0, where In is the n-dimensional identity matrix and indicates positive-deﬁniteness. The positive root minimizes the cost, whereas the negative root maximizes it. A detailed proof is presented in Appendix C.1.

The kernel Kλ n,i(x) in Equation (20) is closely related to the Riesz kernels given by KRj(x, y) = xj yj x y n+1 . More precisely, Kλ n,i(x, y) = KRi(x, y) x y . Given the optimal discriminator and the Lagrange multiplier, consider the optimization of the generator cost. Similar to the 1-D case, the Lagrangian LG of the WGAN-GNP cost in Rn is given as follows:

LG = LWGAN G + λp

X pg(x) dx 1 + Z

X µp(x)pg(x) dx,

where λp and µp(x) are the KKT multipliers. The following result gives the optimal generator distribution.

Theorem 4. Optimal WGAN-GNP generator (n-D): Consider the n-dimensional generator loss LG subject to the integral constraint Ωpg and non-negativity constraint Φpg, given by

X (D (x) (pd(x) pg(x)) + (λp + µp(x)) pg(x)) dx λp,

where D (x) is given by Equation (19), and the KKT multipliers satisfy < µp(x) 0, µp(x)pg(x) = 0, and λp is a ﬁnite real value. Then, the optimal solution set is

p g(x) = pd(x), and µ p(x) = 0, x X,

and the solution is optimal for all ﬁnite real values of λp.

Proof. The proof is similar to the 1-D case and is given in Appendix C.2.

4.1 Constraint Space of the Discriminator (n-D)

Consider the multivariate setting of the ﬁrst-order L2-normed Sobolev Space W 1,2(X, ν), with the norm given by

D W 1,2 = q

D 2 2,ν + D 2 2,ν =

X D(x) 2 dν + Z

X D(x) 2 dν < .

As in the 1-D case, since we do not explicitly enforce a bound on the energy of the discriminator, by virtue of the Poincar e inequality D W 1,2 0 r D 2,ν, the WGAN-GNP

discriminator can be seen as coming from the semi-normed space W 1,2 0 (X, UX ), where UX denotes the uniform measure over X.

Euler-Lagrange Analysis of GANs

4.2 Multi-dimensional Fourier-series Solution

As in the 1-D case, we solve the discriminator PDE in Equation (14) using a Fourier-series expansion, but this time considering the multivariate counterparts:

m Zn αmej ωm,x , pg(x)= X

m Zn βmej ωm,x , and DFS(x)= 1

m Zn γmej ωm,x ,

with frequency harmonics ωm = [m1ω1, m2ω2, . . . , mnωn]T. Substituting the Fourier-series expansions in (14) and comparing terms on both sides gives

, m Zn {0}. (22)

The value of γ0 introduces a DC oﬀset in DFS(x), and without loss of generality, we set γ0 = 0. Similar to the 1-D case, we have αm = 1

T n ϕ pd(ωm) and βm = 1

T n ϕ pg(ωm), where ϕ represents the complex conjugate of the characteristic function of the corresponding distribution. We now present results on applying the WGAN-FS discriminator for 2-D Gaussian and Gaussian mixture learning problems. As in the 1-D case, we truncate the Fourier series to M terms along each dimension, and replace the complex Fourier series with its trigonometric counterpart.

4.3 Illustration Using Synthetic 2-D Data

Experimental setup: We conduct experiments on 2-D Gaussian and 8-component Gaussian mixture models (GMM). We draw Gaussian data from N(0.7512, 0.1I2), where 12 denotes a 2-D vector with both entries equal to 1, and I2 denotes the 2 2 identity matrix. The noise that is input to the generator is drawn from a Gaussian N(02, I2). The choice of baselines, training parameters (learning rate, batch size, Fourier-series parameters, and the Adam optimizer decay rates) and the architectures of the generator and the discriminator are identical to the 1-D scenario (cf. Section 3.7). In the 8-component GMM experiment, isotropic Gaussians are considered with standard deviation 0.05 and means lying in [0, 1] [0, 1]. The noise that is input to the generator is drawn from N(0100, I100). The generator architecture for all WGAN models under consideration consists of three fully connected layers of 128, 64, and 32 nodes with Leaky Re LU activation in each layer. The output layer has two nodes and a sigmoid activation. The discriminator network for the baseline models is a three-layer fully connected network as in the 1-D case. The training parameters are the same as in the 1-D case.

Results: Figures 2(a) and (b) show the Wasserstein-2 distance W2,2(pd, pg) between the generator and true data distributions as a function of the iterations for the WGAN and WGANGNP ﬂavors under consideration, respectively, for 2-D Gaussian data. The Wasserstein-2 distance decays much faster in the case of WGAN-FS compared with the baseline variants. As in the 1-D case, we observe that replacing the baseline gradient penalty with that of WGAN-GNP results in a performance on par with the best-case baseline. Similarly, training a single-layer discriminator with a sinusoidal activation function to approximately learn

Asokan and Seelamantula

100 200 300 ITERATIONS

W2,2(pd, pg)

WGAN WGAN-GP WGAN-LP WGAN-ALP WGAN-Rd WGAN-Rg Sobolev GAN WGAN-FS

100 200 ITERATIONS

W2,2(pd, pg)

WGAN-Rg WGAN-GNP WGAN-GNP(3S)

WGAN-GNP(1S) WGAN-FS

100 200 300 ITERATIONS

W2,2(pd, pg)

lr = 5 10 2

(a) (b) (c)

Figure 2: (

Color online) Experiments on 2-D Gaussian data: (a) & (b) Wasserstein-2 distance (W2,2(pd, pg)) between WGAN-FS and (a) baseline WGAN variants, (b) trainable variants of the proposed WGAN-GNP. The closed-form Fourier-series approach to enforcing the gradient-norm penalty converges an order faster than the baselines and trainable variants of the same loss. (c) Wasserstein-2 distance (W2,2(pd, pg)) for WGAN-FS trained with diﬀerent learning rates for the generator. WGAN-FS is robust to changes in the learning rate, and converges stably in terms of W2,2 for learning rates lr lower than 10 1.

2000 4000 6000 ITERATIONS

W2,2(pd, pg)

2000 4000 6000 ITERATIONS

KL DIVERGENCE

5000 10000 15000 ITERATIONS

W2,2(pd, pg)

WGAN-FS (D p)

WGAN-FS (D p + Dh)

(a) (b) (c)

WGAN WGAN-GP

WGAN-LP WGAN-ALP

WGAN-Rd WGAN-Rg

Sobolev GAN WGAN-FS

Figure 3: (

Color online) Experiments on 2-D Gaussian-mixture data: Comparison of (a) Wasserstein-2 distance (W2,2(pd, pg)), and (b) Kullback-Leibler divergence between the data and generator distributions for WGAN-FS and baseline WGANs. WGAN-FS converges to a lower (better) value than the baselines in terms of both metrics. (c) Comparison of W2,2(pd, pg) versus iterations for WGAN-FS with and without the homogeneous solution Dh(x). The convergence of the WGAN-FS generator is relatively unaﬀected by the homogeneous component.

Euler-Lagrange Analysis of GANs

Figure 4: (

Color online) Convergence of generator distribution (green) to the target multimodal Gaussian data (red) on the considered WGAN variants. The heat map represents the values taken by discriminator. The ideal D(x) is the one that takes larger values at locations where pd > pg and vice versa, converging to a constant after p g approaches pd. The Fourier-series approximation of WGAN-FS approach leads to a better representation of the discriminator during the initial iterations than the baselines, leading to faster convergence. 1K = 1000.

Asokan and Seelamantula

the Fourier coeﬃcients results in poorer performance compared with WGAN-FS, as the suboptimal coeﬃcients cannot represent the distributions pd or pg accurately. Figure 2(c) shows W2,2(pd, pg) for WGAN-FS as the iterations progress considering several learning rates in the generator network. The plot indicates that learning rates in the range of 10 2

to 10 3 are optimal for smooth convergence. For lower learning rates, the convergence is not smooth as evidenced by the noise in W2,2. Learning rates larger than 0.1 resulted in the generator weights blowing up. Figures 3(a) and (b) depict the W2,2 metric and KL divergence, respectively, as a function of iterations for the WGAN baseline models and the proposed WGAN-FS on the GMM learning task. The KL divergence is estimated parametrically by binning batches of samples to form histograms. The Wasserstein-2 distance is computed as a sample estimate using the publicly released Python optimal transport library (Flamary et al., 2021). We observe that, for the given choice of parameters, the baseline WGAN and WGAN-GP models latched on to diﬀerent modes of the GMM at diﬀerent stages of the optimization, failing to capture the entire distribution. We observe that WGAN-FS converges to lower values of the metrics compared with the baselines. Figure 4 shows the convergence of the generator distribution to the target data distribution in each case, while the associated heat-map represents the level-set of D (x) at the given iteration. We observe that, during the initial iterations of training, WGAN-FS learns a signiﬁcantly better representation of the underlying distributions compared with the baselines. This is evident from the fact that, while the baselines require optimizing a neural network for the discriminator, WGAN-FS provides the optimal discriminator for a given generator in closed form/single-shot at each iteration. Figure 3(c) compares the diﬀerence in performance of WGAN-FS with and without the homogeneous solution included. The generator optimization is independent of the homogeneous solution, with nearly identical performance in both cases, which is in accordance with the theoretical results.

5. Implementing the Fourier Series in Higher Dimensions

In higher dimensions, there is a combinatorial explosion in the number of terms given data in Rn, a Fourier-series expansion comprising M harmonics would have n M terms. To get a feel for the kind of computational challenge that we are faced with, consider the MNIST dataset (Le Cun et al., 1998) with data in R784. Even if one were to consider a truncated Fourier-series approximation with a mere 50 terms along each dimension, the total number of Fourier coeﬃcients would be 78450, which is of the order of 10144. To gauge how big this number is, consider the following fact: According to an estimate, there are 1080 atoms in the known observable universe (Fermi-LAT Collaboration, 2018). In this section, we derive bounds on the truncation error and discuss implementation related issues. Experiments on synthetic Gaussian and real-world image data validating these results are provided in Appendix E

5.1 Fourier Series in n-D

Given the Fourier-series expansions as in Equation (21), consider the case where the fundamental frequency ωo is the same along all dimensions. Further, consider the following assumptions on pd and pg:

Euler-Lagrange Analysis of GANs

Assumption 2 (Generator and data distributions). The generator and data distributions are compactly supported, ℓ-times continuously diﬀerentiable functions (pg, pd Cℓ(Rn)), with bounded energy in their gradients up to, and including order k (pg, pd W k,2(X)), and vanish on the boundary of X, i.e., we have pg, pd Cℓ(X) T W k,2(X), where ℓ> k.

It is known that such functions have rapidly decaying Fourier coeﬃcients (Sobolev, 1963; Feﬀerman, 1973). Similar assumptions on the generator and data distributions were required when deriving the convergence rate of the training algorithms for Sobolev GANs (Liang, 2021). We now derive the bound on the mean-squared error incurred while truncating the Fourier series of the discriminator and p.d.f.s with the square partial sum:

m [M]n αmejωo m,x , pg(x) = X

m [M]n βmejωo m,x , and

m [M]n γmejωo m,x , (23)

where [M]n denotes the Cartesian product space { M, M + 1, . . . , M 1, M}n.

Theorem 5. Bounds on the truncation error for the discriminator: Consider the generator and data distributions coming from Cℓ(X) T W k,2(X), ℓ> k, for ﬁnite k, and the inﬁnite and truncated Fourier-series expansions deﬁned in Equations (21) and (23), respectively, where the coeﬃcients γm are given by Equation (22). The mean-squared error in truncation can be bounded as follows:

ϵ2 D = DFS(x) DFS(x) 2 2 Cn,T

(M2n) (k 2)

where Cn,T is a positive constant that depends only on the dimensionality of the data (n), and the period (T).

The proof is provided in Appendix D.2. While the bound given in Theorem 5 is valid for ﬁnite k, given the truncation order M and the dimensionality of the data n, smoother functions (larger k) result in tighter bounds. While the ambient dimension of images is large (for example, MNIST in R784 or Celeb A in R106), thanks to the Manifold Hypothesis (Kelley, 2017), it is reasonable to assume that images reside in lower-dimensional manifolds, or unions thereof (Lui et al., 2017; Khayatkhoei et al., 2018; Lei et al., 2019). We therefore propose to perform the Fourier-series approximation in learning latent representations of the data, where the bound in Equation (24) is more likely to be useful. This is eﬀectively latent-space matching akin to that considered in Wasserstein autoencoders (WAEs) (Tolstikhin et al., 2018). The generator samples are low-dimensional representations of images learnt by an autoencoder, and the target distribution is a truncated Gaussian. We present comparisons on learning multivariate Gaussians using WGAN-FS in Appendix E.2, and learning image-space distributions with WGAN-FS in Appendix E.3. While truncating the Fourier series is one-side of the approximation, the other is that of estimating the coeﬃcients. Consider the Fourier-series representation of the data distribution

Asokan and Seelamantula

pd, where αr m and αr m are the true Fourier coeﬃcient and its N-sample estimate, given by

X cos(ωo m, x )pd(x) dx and αr m = 1

cos(ωo m, xk ), (25)

respectively. Then, the expected error in approximating pd through the sample estimate is given by the following theorem:

Theorem 6. Bound on the Fourier series approximation error for the data distribution: Let the Fourier-series representation of the data distribution pd be as given in Equation (21). Consider the true and N-sample estimates of the Fourier coeﬃcients given in Equation (25). For ﬁnite k, the mean-squared error in approximating the truncated Fourier-series expansion of pd can be bounded as follows:

Ex ϵ2 pd Mn

1 mpd nk+ n+1

| {z } εstat

+ Mpd C n,k

| {z } εtrunc

where mpd < Mpd are two positive constants, C n,k is a positive constant whose value depends on the dimensionality of the data n and the Sobolev order k, and εstat and εtrunc represent the statistical and deterministic components of the error, respectively.

The proof is provided in Appendix D.3. A similar bound can also be derived for the generator distribution pg. The key takeaway from Theorem 6 is that there exists a trade-oﬀ between minimizing the truncation error εtrunc, and the statistical error εstat. For the approximation error εstat to decay, given M and n, the batch size must increase at least at a rate of N Mn+1. This results in a trade-oﬀbetween discarding high-frequency components versus inaccurately estimating them due to ﬁnite sample size, which result in Gibbs oscillations. Experiments illustrating this phenomenon on 1-D data are presented in Appendix E.1. We discuss the choice of the truncation length M in Section 5.2.

5.2 Practical Considerations in Truncated Fourier-series

Motivated by the result in Theorems 5 and 6, we make certain reasonable and simplifying assumptions on the Fourier-series expansion to circumvent the computational barrier. In order to reduce the number of terms in the summation, we consider the fundamental frequency ωo to be the same along all dimensions. We consider two truncation frequencies, Mlow and Mhigh. Since images have a signiﬁcant low-frequency content, we deterministically include all low-frequency components along each dimension to Mlow. To improve the representation of high-frequency components, we perform uniform random sampling in the coeﬃcient space between Mlow and Mhigh. We consider a tiny subset of harmonics from the set of n M harmonic frequencies. We pick L frequencies uniformly at random from U[Mlow, Mhigh]. The matrix of sampled frequencies is given as follows:

j 1 . . . 1 1 j . . . 1 ... ... ... ... 1 1 . . . j

Euler-Lagrange Analysis of GANs

where U is an n L matrix whose elements are drawn from (Mlow, Mhigh]. These simpliﬁcations together with a trigonometric Fourier expansion as in the 1-D case give rise to the following sampled Fourier-series approximation of the optimal discriminator:

D FS(x) 1 λ FS

m M γr m cos(ωo m, x ) + X

m M γi m sin(ωo m, x ) , (28)

where M is a set comprising the columns of M. Additionally, as in the 1-D case, the Fourier coeﬃcients of pd and pg are obtained using their sample estimates computed over batches of size N:

cos(ωo m, xk ), αi m 1 NT

sin(ωo m, xk ),

cos(ωo m, xk ), and βi m 1 NT

sin(ωo m, xk ).

Enforcing the gradient-norm penalty ΩD on (28) results in λ FS in n-D. The worst-case value of λ FS satisﬁes:

v u u t(2|M| + 1)

m M (τ im + τ rm) + 1

m M (τ im τ rm) cos(2ωo m, xk )

2(γr m)2ω2 o m 2, and τ i m = 1

2(γi m)2ω2 o m 2,

and the samples xk are drawn from the uniform mixture of pd and pg. The derivation is included in Appendix C.3. The quality of the sample estimates, measured in terms of the variance and mean-squared approximation error are presented in Appendix D.3. The trade-oﬀbetween the truncation error (caused by discarding harmonics above a truncation order M), and the approximation error (caused by estimating the Fourier coeﬃcients with N-sample averages), suggest that including low-frequency components improves the overall quality of estimation. Based on multiple experiments on synthetic learning tasks, we set Mlow = 2, Mhigh = 10, and L = 103 (cf. Appendix E.2).

6. Wasserstein Adversarial Autoencoder

We extend the Fourier-series based WGAN to high-dimensional latent space matching based on Wasserstein autoencoders (WAEs) (Tolstikhin et al., 2018) and adversarial autoencoders (AAE) (Makhzani et al., 2015). In WAE, a vanilla autoencoder s (Hinton and Zemel, 1994; Schmidhuber, 2014) latent space representation is required to conform to a given prior distribution, usually a Gaussian or a mixture of Gaussians, through an auxiliary network that minimizes the distance between the two distributions. The encoder-decoder pair is

Asokan and Seelamantula

trained by minimizing an appropriate distance measured between the input data distribution and that of the reconstructed samples. In the WAE-GAN setting, the encoder of the WAE plays the role of the GAN generator, which is trained using a discriminator network to force the latent space distribution to match the prior using a suitable GAN loss. Considering the Euclidean distance metric between a data sample x and the corresponding reconstruction x, and the SGAN loss, we obtain the AAE formulation. The vanilla WAE-GAN formulation employs the Euclidean loss for the reconstruction in combination with the KL divergence for the GAN loss. Sliced WAE (Kolouri et al., 2019) extended the framework to accommodate the sliced Wasserstein loss (Deshpande et al., 2018). As an alternative to the adversarial formulation, maximum mean discrepancy (MMD) based variations of the WAE have also gained popularity. The most recent of which, the Cram er-Wold autoencoder (CWAE) (Knop et al., 2020) presents a characteristic kernel that allows for closed-form computation of the distance between the latent distribution and a standard Gaussian. Optimal transport based approaches either approximate the semi-discrete latent-space transportation map with the continuous Bernier potential by drawing links between the latent-space matching and the Monge-Amp ere equation (Lei et al., 2019; An et al., 2020) or determine the Kantorovich potential in a two-step manner to learn a mapping from the source/prior distribution to the target using a WGAN-GP discriminator (Liu et al., 2018). Adding regularizers to the discriminator cost in AAEs has shown to improve the interpolation capabilities for the autoencoder component (Berthelot et al., 2019). We introduce WAEFR, which is the Fourier-series representation of the discriminator integrated within the WAE-GAN framework. The block diagram is presented in Figure 5. Within WAEFR, the roles of the target and generator distributions are swapped, compared to WGAN-FS. In WAEFR, the target is the standard normal distribution N(0, I), while the latent space distribution is optimized to match the target. In WAEFR, we use the mean-squared error for training the encoder-decoder pair:

LAE(x, x) = x x 2 2,

where x = Decoder(Encoder(x)) is the reconstruction of x. The encoder-discriminator pair is trained using the WGAN-FS algorithm described in Section 4.2. We ﬁrst solve for the Fourier coeﬃcients βm and αm corresponding to the latent space distribution, pdℓ= Encoder(pd), and the latent space prior pℓ, respectively, which gives us the closed-form discriminator. Subsequently, the encoder is trained with the WGAN loss:

LG = E z pdℓ[D( z)] Ez pℓ[D(z)].

The training procedure for WAEFR is summarized in Algorithm 1.

6.1 Fourier-series Discriminator

To implement D(x) such that the gradients ﬂow to train the generator, the Tensor Flow data handling pipeline was used gainfully by representing Equation (28) as a static two-layer network with an intermediate Fourier-series solver that computes the network weights in each iteration. Figure 6 depicts the network architecture. The operation ωo m, x , with the associated weight matrix ωo M as given in Equation (27), constitutes the ﬁrst layer and is followed by cosine/sine activations. The output of the ﬁrst layer is used over batches of data

Euler-Lagrange Analysis of GANs

Figure 5: (

Color online) The training and testing pipelines of WAEFR, the Wasserstein autoencoder with the Fourier-series representation of the discriminator. The discriminator function is evaluated in closed form based on the Fourier coeﬃcients determined from the latent space distribution and the desired prior.

Figure 6: (

Color online) Fourier-series based discriminator architecture. The latent representations are in Rn, the data is input to the network in batches of size N, and the Fourier-series summation is truncated to L terms.

of size N to estimate the Fourier-series coeﬃcients, from which the discriminator weights γr m and γi m, and the parameters τ r m and τ i m that determine the Lagrange multiplier are evaluated (Section 5.1). Dense network connections with these weights, together with the evaluation of the homogeneous component Dh and Lagrange multiplier λ FS, form the second layer of the discriminator network. While the ﬁrst layer is ﬁxed throughout the training of the GAN, the weights and biases of the second layer are evaluated for each batch of data.

Asokan and Seelamantula

Algorithm 1 WAEFR Training the Wasserstein Autoencoder with a Fourier-series discriminator.

Inputs: Training data x pd, prior distribution N(µz, Σz), batch size N, learning rate η, number of GAN pre-training epochs n GAN Models: Encoder/Generator Encθ; Decoder Decψ; Fourier-series discriminator D FS. GAN pre-training: for n GAN iterations do

Sample: x pd A batch of N real data samples. Sample: z = Encθ(x) Latent encoding of real data. Sample: z N(µz, Σz) A batch of N prior distribution samples. Compute: Fourier coeﬃcients αm and βm Compute: Discriminator coeﬃcients γm Compute: Optimal Lagrange multiplier λ FS Evaluate: WGAN-FS loss LG(D FS( z), D FS(z)) Update: Generator Encθ η θ[LG] end for WAEFR training: while Encθ, Decψ not converged do

Sample: x pd A batch of N real samples. Sample: z = Encθ(x) Latent encoding of real samples. Sample: x = Decψ( z) Reconstructed samples. Evaluate: Autoencoder Loss: LAE(x, x) Update: Autoencoder Encθ η θ[LAE]; Decψ η ψ[LAE] Sample: z N(µz, Σz) A batch of N prior distribution samples. Compute: Fourier coeﬃcients αm, βm, and γm Compute: Optimal Lagrange multiplier λ FS Evaluate: WGAN-FS loss LG(D FS( z), D FS(z)) Update: Generator Encθ η θ[LG] end while Output: Reconstructed random prior samples: Decψ(z)

6.2 Experiments on Image Datasets

To illustrate the performance of WAEFR, we consider a learning task on several standard datasets: MNIST (Le Cun et al., 1998), SVHN (Netzer et al., 2011), Celeb A (Liu et al., 2015), CIFAR-10 (Krizhevsky, 2009), and Ukiyo-E faces (Pinkney and Adler, 2020). On CIFAR-10, we consider both multi-class and single-class learning tasks. Experimental setup: The convolutional autoencoder model proposed by Tolstikhin et al. (2018) is employed for both the baseline WAEs and WAEFR. The prior distribution is a 16-D Gaussian for MNIST, and 64-D Gaussian for the other datasets. In WAEFR, the Fourier-series period is set to T = 15, and the latent representation is passed through a linear activation with saturation (clipping) of the latent vector amplitudes beyond [ 10, 10] in order to prevent latching on to an aliased Fourier representation. Based on the analysis presented in Appendix E.2, the Fourier-series summation is truncated with Mlow = 2 and Mhigh = 10 with L = 102 randomly sampled high-frequency terms. While the baseline WAEs uses the

Euler-Lagrange Analysis of GANs

deep convolutional discriminators (Tolstikhin et al., 2018), the WAEFR discriminator is as described in Section 6.1 where the weights are determined single-shot. This facilitates faster training of the encoder compared against an out-of-loop evaluation of D(x). A batch size of 150 is used. The networks are trained using the Adam optimizer (Kingma and Ba, 2015). A learning rate of 2 10 4 is used for all the variants. The models are trained on 2 104 batches for MNIST, 5 104 batches for CIFAR-10, 7 104 batches for SVHN and 105 batches for Celeb A and Ukiyo-E. We consider the following baselines:

The Jensen-Shannon divergence GAN loss, which is equivalent to the base WAE conﬁguration (Tolstikhin et al., 2018).

The KL-divergence based Wasserstein Autoencoder (WAE-KL) (Tolstikhin et al., 2018)

The WGAN loss, corresponding to a Wasserstein adversarial autoencoder (WAAE) with the Lipschitz penalty (Petzka et al., 2018) (WAAE-LP).

The WGAN loss with adversarial Lipschitz penalty (Terj ek, 2020) (WAAE-ALP).

The Cram er-Wold autoencoder (CWAE) (Knop et al., 2020), which computes the Cram er-Wold distance between the latent-space and target Gaussian distributions.

The autoencoder loss is the mean-squared error in all the cases. Additionally, for all WAE baselines, the discriminator is updated thrice for every update of the generator. Pre-training the GAN component (Encoder-discriminator pair) for 10 epochs was found to result in faster convergence across all WAE-GAN variants. Celeb A and Ukiyo-E images are resized to 64 64 3. Pixel intensities are rescaled to [ 1, 1] in all experiments.

Evaluation metrics: The WAE variants are evaluated on the following metrics:

The adversarial network s ability to match the latent and prior distributions, in terms of the Fr echet inception distance (FID) (Heusel et al., 2017) evaluated on batches of images decoded from prior sample vectors.

The quality of the autoencoder s reconstructed samples, measured in terms of the average reconstruction error RE on unseen test set images, and deﬁned as follows: RE = 1 N PN i=1 xi xi 2 2, where {xi} are the samples and { xi} are the corresponding reconstructions.

The continuity of the latent space, demonstrated visually by decoding the interpolated points between the latent representations of two target data samples.

The sharpness of the decoded images measured using the variance of the Laplacian of the image as proposed by Tolstikhin et al. (2018).

Additional details on the computation of these metrics are included in Appendix E.4. We analyze the FID and average reconstruction error as a function of the iterations. To demonstrate the continuity of the latent space, we linearly interpolate between the latent representations of the test set images and present the decoded interpolated images. We tabulate FID scores and RE for the converged models. For the case of single-class learning on CIFAR-10, all metrics are averaged across results obtained from training the models on

Asokan and Seelamantula

each of the ten classes, while images are presented for the Boat class. For WAEFR, we also plot λ FS as a function of iterations to quantify the convergence of pdℓto pℓ. Results: Figure 7 presents the generator loss LG and the optimal Lagrange multiplier λ FS as a function of iterations when training WAEFR on each of the datasets considered. We observe that, in each case, λ FS converges in less than 200 iterations. This indicates that the GAN component of WAEFR converges early in the training, with the latent space of the generator taking the form of the desired prior, while subsequent training improves the accuracy of the autoencoder s mapping from the latent space to the target images. Figure 8 presents reconstructed samples of test images for all the variants under consideration. We observe that the reconstructed image quality of WAEFR is on par with that of the baseline approaches. From the RE versus iterations plots shown in Figure 9, we observe that the reconstruction error of WAEFR after convergence is lower than that of the baselines on all datasets. Further, RE of WAEFR converges more smoothly compared with the baselines. The jitter in case of the baseline models may be attributed to the switching between the GAN and the autoencoder components of the WAE. On the other hand, since WAEFR considers a closed-form evaluation of the discriminator, the convergence behavior of the GAN, and consequently, the autoencoder, is smoother and superior. Figure 10 shows images generated by decoding randomly drawn samples from the prior distribution. While WAEFR generates images of visually comparable quality, the samples from Celeb A are sharper and more diverse than the baseline models. All the variants generated more realistic images on single-class learning than on the multi-class task in CIFAR-10. Figure 11 shows the convergence of the FID score as the iterations progress. WAEFR outperforms the baselines by a signiﬁcant margin when the latent space dimension is small, as in the case of MNIST, or when training with limited data, such as the Ukiyo-E dataset. WAEFR is on par with WAAE-LP and CWAE baselines on high-dimensional data. Figures 12 17 show the images obtained by decoding interpolated points in the latent space. The ﬁrst and last images in each case depict the ground truth reference images. The interpolations in WAEFR are on par with the baselines on MNIST. On the SVHN, Celeb A, and Ukiyo-E datasets, WAEFR generates sharper images than the baselines. In the case of multi-class CIFAR-10, all variants failed to generate a realistic interpolation. This may be attributed to the large inter-class diversity in CIFAR-10. Table 2 presents the best-case FID scores, RE values, and the sharpness metric of the various models. Sharpness is evaluated in two scenarios: images obtained by decoding the latent-space interpolation between images; and the decoded samples drawn randomly following a prior distribution. WAEFR outperforms the baselines in terms of FID on Celeb A, SVHN, and Ukiyo-E datasets, while achieving comparable performance on MNIST and CIFAR-10. In all the cases, WAEFR achieved the lowest reconstruction error. WAEFR also achieves up to two-fold improvement in image sharpness on face image datasets such as Celeb A and Ukiyo-E in comparison with the baselines.

Euler-Lagrange Analysis of GANs

100 200 300 400 ITERATIONS

LAGRANGIAN MULTIPLIER: λ F S

200 400 600 800 ITERATIONS

LAGRANGIAN MULTIPLIER: λ F S

Celeb A Ukiyo-E

200 400 600 800 ITERATIONS

LAGRANGIAN MULTIPLIER: λ F S

200 400 600 800 1000 1200 ITERATIONS

LAGRANGIAN MULTIPLIER: λ F S

CIFAR-10 (Single class) CIFAR-10

200 400 600 800 ITERATIONS

LAGRANGIAN MULTIPLIER: λ F S

200 400 600 800 ITERATIONS

LAGRANGIAN MULTIPLIER: λ F S

Figure 7: (

Color online) Convergence of the optimal Lagrange multiplier λ FS and generator loss LG versus iterations for WAEFR trained on various datasets. The Lagrange multiplier is a measure of how quickly and stably the GAN component of the WAE converges. We observe that in all cases, the latent-space distribution matching is achieved by the GAN component in fewer than 200 iterations.

Asokan and Seelamantula

Figure 8: (

Color online) Reconstructed images from WAE, WAE-KL, WAAE-LP, WAAEALP, CWAE and WAEFR. While WAEFR generates sharper and more detailed reconstructions on face image datasets such as Celeb A or natural image datasets such as CIFAR-10, its performance on the other datasets is on par with the baselines.

Euler-Lagrange Analysis of GANs

5000 10000 15000 ITERATIONS

20000 40000 60000 ITERATIONS

Celeb A Ukiyo-E

20000 40000 60000 80000

20000 40000 60000 80000

CIFAR-10 (Single class) CIFAR-10

10000 20000 30000 40000

10000 20000 30000 40000 ITERATIONS

WAE WAE-KL WAAE-LP WAAE-ALP CWAE WAEFR

Figure 9: (

Color online) Average reconstruction error RE versus iterations for various WAE GAN approaches considered. WAEFR converges to a lower RE in all the cases considered and its convergence is also smoother than the baseline variants.

Asokan and Seelamantula

Figure 10: (

Color online) Images generated by WAE, WAE-KL, WAAE-LP, WAAEALP, CWAE and WAEFR by decoding random samples drawn from the prior distribution. WAEFR generates images of comparable quality on MNIST and CIFAR-10, while producing more diverse and sharper images on the SVHN, Celeb A, and Ukiyo-E datasets.

Euler-Lagrange Analysis of GANs

5000 10000 15000 ITERATIONS

20000 40000 60000 ITERATIONS

Celeb A Ukiyo-E

20000 40000 60000 80000 ITERATIONS

50000 100000 150000 ITERATIONS

CIFAR-10 (Single class) CIFAR-10

10000 20000 30000 40000 ITERATIONS

10000 20000 30000 40000 ITERATIONS

WAE WAE-KL WAAE-LP WAAE-ALP CWAE WAEFR

Figure 11: (

Color online) FID vs. iterations for the various WAE GAN ﬂavors considered, when evaluated on images generated by decoding randomly drawn samples following a prior distribution. WAEFR exhibits faster convergence on lower-dimensional latent-space representations (as in the case of MNIST) and comparable convergence for the higher-dimensional ones (the remaining datasets).

Asokan and Seelamantula

Figure 12: Images generated by decoding interpolated points between the latent space representations of two test images from the MNIST dataset. WAEFR interpolation gives rise to sharper images and superior convergence to the target image.

Euler-Lagrange Analysis of GANs

Figure 13: (

Color online) Images generated by decoding interpolated points between the latent space representations of two validation set images on SVHN. Interpolations in WAEFR are sharper. CWAE fails to learn accurate reconstructions. Reconstructions of interpolated points produce visually sharper images in WAEFR.

Asokan and Seelamantula

Figure 14: (

Color online) Interpolation on the Celeb A dataset. The images generated by WAEFR are sharper, preserve more details, and are closer to the ground truth indicating that the representations learnt by WAEFR are more accurate.

Euler-Lagrange Analysis of GANs

Figure 15: (

Color online) Images generated by decoding interpolated points between the latent space representations of two validation set images from the Ukiyo-E faces dataset. WAEFR results in much sharper images than the baselines.

Asokan and Seelamantula

Figure 16: (

Color online) Images generated by decoding interpolated points between the latent representations of two test set images from the Boat class of CIFAR-10. The performance of WAEFR is comparable to that of the baseline models.

Euler-Lagrange Analysis of GANs

Figure 17: (

Color online) Interpolation across image classes from CIFAR-10 dataset. In all the cases, the interpolated images are unrealistic, which indicates that the interclass variability is too high to produce a semantically meaningful interpolation.

Asokan and Seelamantula

GAN ﬂavor MNIST SVHN Celeb A Ukiyo-E CIFAR-10 (Averaged)

WAE 21.676 46.083 42.943 204.446 124.165 123.88

WAE-KL 26.231 59.717 59.223 215.013 112.650 115.96

WAAE-LP 20.240 47.332 43.509 195.133 108.512 108.95

WAAE-ALP 22.306 48.128 45.628 200.330 107.509 110.223

CWAE 22.125 46.757 47.963 207.350 114.689 102.062

WAEFR 19.793 44.811 37.195 192.049 108.804 100.754

WAE 0.0827 0.0425 0.0939 0.0520 0.1786 0.125

WAE-KL 0.0707 0.0380 0.0776 0.0421 0.1254 0.116

WAAE-LP 0.0747 0.0353 0.0868 0.0429 0.1382 0.117

WAAE-ALP 0.0836 0.0377 0.0956 0.0479 0.1294 0.119

CWAE 0.0735 0.0478 0.0852 0.0831 0.1729 0.112

WAEFR 0.0693 0.0310 0.0762 0.0417 0.1227 0.107

WAE 0.1567 0.0018 0.0015 0.1210 0.0625 0.0011

WAE-KL 0.1317 0.0014 0.0018 0.1255 0.0039 0.0032

WAAE-LP 0.1520 0.0017 0.0044 0.1566 0.0155 0.0029

WAAE-ALP 0.1609 0.0017 0.0035 0.1441 0.0150 0.0039

CWAE 0.1703 0.0019 0.0036 0.0821 0.0158 0.0086

WAEFR 0.1717 0.0028 0.0084 0.2275 0.0194 0.0110

Interpolation

WAE 0.1681 0.0022 0.0032 0.0270 0.0035 0.0027

WAE-KL 0.1629 0.0022 0.0044 0.0229 0.0053 0.0054

WAAE-LP 0.1706 0.0024 0.0044 0.0383 0.0036 0.0041

WAAE-ALP 0.1031 0.0019 0.0038 0.0345 0.0061 0.0031

CWAE 0.1387 0.0028 0.0034 0.0136 0.0061 0.0045

WAEFR 0.1746 0.0029 0.0077 0.0330 0.0068 0.0064

Benchmark 0.1950 0.0051 0.0318 0.1805 0.0278 0.0361

Table 2: A comparison of various GAN ﬂavors in terms of the performance metrics across several standard datasets. The best values are highlighted in boldface. CIFAR-10 (Averaged) corresponds to the metric computed per class followed by averaging across classes. The FID and RE scores (lower the better, indicated by ) are the best for WAEFR for almost all datasets. Sharpness (Random) is the sharpness computed based on random samples drawn from the prior distribution, whereas Sharpness (Interpolation) is the sharpness computed based on the interpolated images. Sharpness (Benchmark) is the sharpness value obtained over the entire dataset. Closer the sharpness to the benchmark, better is the model. We observe that WAEFR achieves almost two-fold improvement in sharpness values over the best-case baselines on Celeb A.

Euler-Lagrange Analysis of GANs

7. Conclusion and Discussion

In this paper, we considered the functional optimization of standard GAN losses in a variational setting, by enforcing the Euler-Lagrange (EL) conditions to determine the optimum. While this approach subsumes point-wise optimization when the integral costs do not contain terms involving the gradients of the discriminator D or generator distribution pg, the EL conditions become indispensable when gradient penalties or constraints are enforced. To truly appreciate the importance of Euler-Lagrange analysis, we considered the Wasserstein GAN subjected to a novel variant of the gradient-norm penalty (WGAN-GNP). This resulted in the optimal discriminator being the solution to a second-order partial diﬀerential equation (PDE). In principle, solving the PDE obviates the need for learning a neural network based discriminator. We showed analytically, both in the univariate as well as the multivariate settings, that the WGAN-GNP formulation results in the optimal generator distribution that matches with the desired data distribution. We did so not by assuming pg to be a distribution, but by enforcing it through constraints of point-wise non-negativity and unit area under the curve. In addition, the discriminator PDE allows us to obtain a closed-form expression for the optimal Lagrange multiplier λ d corresponding to the gradient-norm penalty, obviating the need to perform a hyperparameter search for the optimal λd based on empirical evidence.

The PDE connection for the optimal discriminator provides a novel viewpoint for GAN optimization. By employing a Fourier-series approximation, we showed that a single-shot solution can be obtained for the discriminator, given the generator. The solution relies on the estimates of the characteristic functions of the data and generator distributions. The superior performance of this novel approach was demonstrated in the univariate as well as low-dimensional multivariate Gaussian settings. A shortcoming, however, is that the approach does not scale well with the dimensionality of the data. In order to overcome this hurdle, we proposed several approximations in the high-dimensional scenario: Fourierseries model-order truncation, sample estimates of the characteristic function, and random sampling of the high-frequency harmonics. We presented bounds on the approximation error in each of these cases, which brought to light a trade-oﬀbetween the truncation error and sample estimation error. While including higher-order terms improves the quality of the approximation, poorly estimating the high-frequency coeﬃcients due to limited batches of data increases the estimation error. All of these approximations operating in the latent space of the high-dimensional data, as discovered by a Wasserstein autoencoder, resulted in a tractable model, as also demonstrated through experimental results on several standard datasets. It is important to mention that, despite several simpliﬁcations in the Fourier-series model, the proposed single-shot optimal discriminator in WGAN-FS and WAEFR resulted in a performance that is on par with the more sophisticated neural network counterparts, not only achieving faster convergence, but also up to two-fold improvement in the sharpness measure on face image datasets such as Celeb A and Ukiyo-E.

Future Scope: The choice of Fourier bases was motivated by the speciﬁc PDE to be solved, which has a natural connection to harmonic functions, by virtue of the eigenfunction property. Owing to the orthogonality property of the Fourier bases, determining the coeﬃcients also became signiﬁcantly simpler. The Fourier approach serves as a proof of concept, and alternative, and potentially more parsimonious, bases representations (for

Asokan and Seelamantula

instance, wavelets) could also be employed for function approximation. In the context of WGAN-FS, the need for continuously diﬀerentiable distributions gives insights into potential generator architectures. Networks that can approximate the Laplacian of the discriminator function, or those with inﬁnitely diﬀerentiable activation functions are potential directions for further exploration. Analogous to the WGAN-FS model, one could also consider bases expansions in the context of other GANs, for instance, the Stein operator (Oates et al., 2017) in Sobolev GANs. Euler-Lagrange analysis can also be employed to GAN losses that cannot be accommodated within the standard divergence minimization framework. While we presented results for non-saturating SGAN (Appendix A.3) and LSGAN (Appendix A.4), other variants such as the relativistic discriminator based GANs (Jolicoeur-Martineau, 2019) or the cycle consistent GAN (Zhu et al., 2017) could also be analyzed in the Euler-Lagrange framework. The framework would also be applicable in the scenario where the GAN loss includes derivatives of the generator distribution as well. It would also be suitable for several other regularized GAN variants (Kodali et al., 2017; Roth et al., 2017; Mescheder et al., 2018; Arbel et al., 2018) as illustrated in Appendix F.

Acknowledgements

This work is supported by the Microsoft Research Ph.D. Fellowship 2018, Qualcomm Innovation Fellowship 2019, 2021 and 2022, Robert Bosch Center for Cyber-Physical Systems Ph.D. Fellowships (2020-2021; 2021-2022), and the Science and Engineering Research Board (SERB) Core Research Grant (Co RE), Department of Science and Technology.

Git Hub Repository

The source code for the Tensor Flow 2.0 (Abadi et al., 2016) implementations and the comparisons presented in this paper, the pre-trained models, and high-resolution images of the results are available at https://github.com/Darth Sid95/ELF_GANs.

Euler-Lagrange Analysis of GANs

Table of Contents

A Euler-Lagrange Analysis of Divergence Minimizing GANs 48

A.1 f-GANs and the Euler-Lagrange Condition . . . . . . . . . . . . . . . . . 49

A.2 The Optimality of f-GANs . . . . . . . . . . . . . . . . . . . . . . . . . . 52

A.3 Euler-Lagrange Analysis of the Non-saturating SGAN . . . . . . . . . . . 54

A.4 Euler-Lagrange Analysis of the Least-squares GAN . . . . . . . . . . . . . 55

B Optimality of WGAN-GNP (1-D) 57

B.1 Optimal Lagrange Multiplier (1-D) . . . . . . . . . . . . . . . . . . . . . . 57

B.2 Optimal WGAN-GNP Generator (1-D) . . . . . . . . . . . . . . . . . . . 58

B.3 Optimal Lagrange Multiplier in WGAN-FS (1-D) . . . . . . . . . . . . . . 61

C Optimality of WGAN-GNP (n-D) 62

C.1 Optimal Lagrange Multiplier (n-D) . . . . . . . . . . . . . . . . . . . . . . 62

C.2 Optimal WGAN-GNP Generator (n-D) . . . . . . . . . . . . . . . . . . . 64

C.3 Optimal Lagrange Multiplier in WGAN-FS (n-D) . . . . . . . . . . . . . . 67

D Fourier-series Error Analysis 68

D.1 Fourier-series Truncation Error (1-D) . . . . . . . . . . . . . . . . . . . . . 69

D.2 Fourier-series Truncation Error (n-D) . . . . . . . . . . . . . . . . . . . . 70

D.3 Error in the Sample Estimation of Fourier Coeﬃcients . . . . . . . . . . . 74

E Additional Experimentation 78

E.1 Additional Experiments on 1-D and 2-D Gaussians . . . . . . . . . . . . . 78

E.2 Experiments on n-dimensional Gaussians . . . . . . . . . . . . . . . . . . 85

E.3 Image-space Matching with WGAN-FS . . . . . . . . . . . . . . . . . . . . 90

E.4 Additional Details on Evaluation Metrics . . . . . . . . . . . . . . . . . . 91

F Other Gradient-Regularized GANs 92

Asokan and Seelamantula

An Overview of the Appendices

The appendices are structured as follows: Appendix A presents an analysis of divergence minimizing GANs and their variants within the Euler-Lagrange framework. Appendix B provides proofs of theorems in 1-D, while extensions to n-D are presented in Appendix C. We analyze the various sources of error in the Fourier-series approximation in Appendix D. In Appendix E, we present additional experiments on learning multivariate synthetic Gaussians and image-space distributions with WGAN-FS. Appendix F contains the analysis of the Wasserstein GAN subject to the other gradient penalties proposed by Gulrajani et al. (2017) and Mescheder et al. (2018), and the SGAN and LSGAN subjected to the proposed gradient-norm penalty.

Appendix A. Euler-Lagrange Analysis of Divergence Minimizing GANs

We analyze the various divergence minimizing GANs within the variational framework and show that the degenerate Euler-Lagrange condition applies to determine the optimum in these GANs. Table 3 shows the discriminator and generator loss functions of various GANs that fall under this category. The standard GAN (SGAN) proposed by Goodfellow et al. (2014) considers both saturating and non-saturating losses. The term saturation refers to the generator gradients vanishing during training. The vanilla SGAN (employing the saturating loss) results in a min-max zero-sum game where the optimal generator minimizes the Jensen-Shannon divergence between pd and pg. On the contrary, the SGAN with a non-saturating loss (SGAN-NS) does not readily map to a divergence, but is preferred in a neural-network implementation as it provides better gradients at the cost of increased sensitivity to hyperparameters (Fedus et al., 2018). In the least-squares GAN (LSGAN), one minimizes the squared distance between the discriminator prediction and the class labels (a, b, c) for fake, real, and generated samples, respectively, with the optimization objective minimizing the Pearson-χ2 divergence when b c = 1 and b a = 2 (Mao et al., 2017).

Nowozin et al. (2016) considered f-divergences of the form:

Df(pd pg) = Z

X pg(x)f pd(x)

where the divergence function f : R+ R is convex, lower-semicontinuous and satisﬁes f(1) = 0, and X is a suitable domain of integration. Nowozin et al. (2016) demonstrated that, in a GAN setting, the minimization of Df(pd pg) is equivalent to the sequential minimization of the discriminator and generator losses:

Lf D = E x pd [T(x)] + E x pg[fc(T(x))], and (29)

Lf G = E x pd [T (x)] E x pg[fc(T (x))], (30)

with respect to D and pg, respectively, where T(x) is the output of the discriminator subjected to activation g, that is, T(x) = g(D(x)) and T (x) = g(D (x)), where D (x) is the optimal discriminator, and fc is the Fenchel conjugate of f. The choice of the divergence f and the activation g gives rise to GANs that minimize divergences such as Kullback-Leibler (KL), reverse KL, Pearson-χ2, squared Hellinger, Jensen-Shannon, etc.

Euler-Lagrange Analysis of GANs

GAN Discriminator loss LD Generator loss LG

SGAN LS D= Ex pd[ln D(x)] Ex pg [ln(1 D(x))] LS G=Ex pd[ln D (x)]+Ex pg [ln(1 D (x))]

SGAN-NS LNS D = Ex pd[ln D(x)] Ex pg [ln(1 D(x))] LNS G = Ex pd[ln(1 D (x))] Ex pg [ln D (x)]

LSGAN LLS D =Ex pd[(D(x) b)2]+Ex pg [(D(x) a)2] LLS G =Ex pd[(D (x) c)2]+Ex pg [(D (x) c)2]

f-GAN Lf D= Ex pd[g(D(x))]+Ex pg [fc(g(D(x)))] Lf G=Ex pd[g(D (x))] Ex pg [fc(g(D (x)))]

Table 3: A summary of various divergence minimizing GAN losses considered in this paper. The SGAN and f-GAN losses are symmetric and lead to a min-max optimization problem, whereas the LSGAN and SGAN-NS are not symmetric.

A.1 f-GANs and the Euler-Lagrange Condition

We now reformulate f-GANs subject to the non-negativity and integral constraints within the Euler-Lagrange framework. The results obtained are consistent with those available in the literature.

Theorem 7. Optimality of f-GANs: Consider the optimization of the f-GAN losses Lf D and Lf G given in Equations (29) and (30), respectively. Let pg be subject to the integral constraint Ωpg : R

X pg(x)dx = 1, and the non-negativity constraint Φpg : pg(x) 0. The f-GAN optimization is formulated as:

X Ff(x, T, pg) dx , and (31a)

X Ff(x, T , pg) + (λp + µp(x)) pg(x) dx , (31b)

Ff(x, T, pg) = (T(x)pd(x) fc(T(x))pg(x)) , and T (x) = g(D (x)),

D being the optimal discriminator function, and λp and µp(x) being the Karush-Kuhn Tucker (KKT) multipliers, such that µp(x) 0 and µp(x)pg(x) = 0, x X. The integrals are assumed to be well-deﬁned over the support X. The optimal discriminator D (x) for a given pg satisﬁes:

and the optimal generator p g(x), given D (x), satisﬁes:

fc(T ) pg=p g = λ p + µ p(x), (33)

where λ p and µ p(x) are the optimal KKT multipliers.

Asokan and Seelamantula

Proof. The proof proceeds by applying the Euler-Lagrange conditions to the costs given in Equations (31a) and (31b). Consider the f-GAN discriminator and generator losses (cf. Equations (29) and (30)):

Lf D = E x pd [T(x)] + E x pg[fc(T(x))], and Lf G = E x pd [T (x)] E x pg[fc(T (x))],

respectively. Expressing the expectations in integral form gives

X (T(x)pd(x) + fc(T(x))pg(x)) dx = Z

X Ff D(x, T(x)) dx.

As the optimization of Lf D over D(x) does not involve gradient terms, the Euler-Lagrange

condition applies point-wise: Ff D D = 0, which yields pd pg fc(T)

T=T = 0, (34)

T = g(D) T D = g (D),

g being the activation function at the output of the discriminator network. Based on the f-GAN formulations of Nowozin et al. (2016) (cf. Column 2 of Table 4), we observe that g (D) = 0, x such that D(x) = 0. This yields the ﬁrst result of Theorem 7:

The optimal discriminator D (x) is the one that satisﬁes the above equation with the corresponding output function T = g(D ). Given T , the generator optimization is carried out on Lf G, subject to the integral constraint Ωpg and non-negativity constraint Φpg. The Lagrangian of the cost becomes

X (T pd + fc(T )pg + λppg + µp(x)pg) dx, (35)

where λp and µp(x) are the KKT multipliers. Applying the EL condition to Lf G gives: pd pg fc(T )

pg fc(T ) + λp + µp(x) = 0.

Using (34) gives the condition that the optimal generator p g must satisfy:

fc(T ) pg=p g = λp + µp(x).

The feasible KKT multipliers satisfy the integral and non-negativity constraints when enforced on p g: Z

X p g(x) dx = 1, and p g(x) 0, x X.

Euler-Lagrange Analysis of GANs

f-divergence g(D) fc(T) D (x) p g(x) (λ p, µ p)

Kullback-Leibler (KL) D e T 1 1 + ln pd pg

pd(x) λ p+µ p (1, 0)

Reverse KL e D 1 ln( T) ln pd pg

pd(x) eλ p+µ p+1 ( 1, 0)

Pearson-χ2 D 1 4T 2 + T 2 pd pg

λ p+µ p+1 (0, 0)

Squared-Hellinger 1 e D T 1 T

1 2 ln pd pg

pd(x) (λ p+µ p+1)2 ( 2, 0)

SGAN ln e D e D+1

pd(x) eλ p+µ p 1 (ln 2, 0)

Table 4: The optimal discriminator D (x) and generator p g for various f-GANs (Nowozin et al., 2016) given the activation function g and the Fenchel conjugate fc. The optima D (x) and p g are the solutions to Equations (32) and (33), respectively.

In addition, µp(x) 0 and it satisﬁes the complementary slackness condition

µp(x)p g(x) = 0, x X,

which gives µp(x) = 0 whenever p g(x) > 0, and µp(x) 0 whenever p g(x) = 0. For all x such that pd(x) = 0, the generator cost evaluated at the optimal generator distribution Lf G(p g, λp, µp(x)) becomes zero, and subsequently, the choice of µ p(x) over this set is immaterial. Subsequent optimization of Lf G(p g, λp, µp(x)), x such that pd(x) > 0, over the KKT multipliers gives us the optimal multipliers λ p and µ p(x). This yields the second result of Theorem 7:

fc(T ) pg=p g = λ p + µ p(x).

When applied to any f-GAN variant (Nowozin et al., 2016), Theorem 7 yields the optimal discriminator D and optimal generator p g, as listed in Table 4, which are consistent with the results known in f-GAN literature. We provide the proofs in Appendix A.2. Enforcing the complementary slackness condition: µ p(x)p g(x) = 0, x X, in addition to the integral and non-negativity constraints yields the optimal λ p and µ p(x) for each f-GAN. We observe from Table 4 that the optimal generator distribution p g(x) is non-negative for all f-GAN variants. Consequently, the optimal KKT multiplier µ p(x) = 0, x such that pd(x) > 0. The calculations show that it suﬃces to enforce only the integral constraint and the optimal solution automatically satisﬁes the non-negativity constraint. For the f-GAN variants considered, the optimal generator is p g(x) = pd(x), which is indeed the desired solution of the GAN optimization.

Asokan and Seelamantula

A.2 The Optimality of f-GANs

We now consider each of the f-GAN variants proposed by Nowozin et al. (2016) and present the optimal discriminator and generator functions in each case.

KL divergence: As an illustration, consider the f-GAN formulation with Kullback-Leibler divergence, which corresponds to g(D) = D and fc(T) = e T 1. Following Theorem 7, the optimal discriminator and generator are given by:

D (x) = 1 + ln pd(x)

and p g(x) = pd(x) λp + µp(x),

respectively. The support of the solution is restricted to supp(pd). We split the support X into two disjoint sets: X+ = {x | pd(x) > 0} and X0 = {x | pd(x) = 0}. The loss LG(p g, λp, µp(x)) vanishes everywhere outside X+, and hence the optimization is undeﬁned over X0. Within X+, enforcing the complementary slackness condition on µp gives:

µ p(x) = 0, x such that pd(x) > 0.

Enforcing the integral constraint yields λ p = 1 as the only feasible solution. This gives us the optimal generator distribution:

p g(x) = pd(x), x X.

In summary, the optimal generator perfectly agrees with the data distribution.

Reverse-KL divergence: The EL analysis of the reverse-KL divergence based f-GAN closely follows the analysis for the KL-divergence based GAN. We have g(D) = e D and fc(T) = 1 ln( T). Following Theorem 7, we obtain the optimal discriminator and generator functions as

D (x) = ln pd(x)

and p g(x) = pd(x) eλp+µp(x)+1 ,

respectively. Enforcing the complementary slackness condition µp(x)p g(x) = 0, we obtain

the condition: µp(x) eλp+µp(x)+1 = 0, x such that pd(x) > 0. In conjunction with µp(x) 0, we obtain µ p(x) = 0 as the only feasible solution. As in the case of the KL-divergence f-GAN, the EL analysis is applicable only when the integrand in the cost is non-zero, that is, pg(x) > 0. Enforcing the integral constraint results in λp = 1 as the only solution. This gives us the optimal KKT multipliers µ p(x) = 0, x X+, and λ p = 1. The corresponding optimal generator distribution is p g(x) = pd(x).

Jensen-Shannon divergence: In this case, the f-GAN formulation considers g(D) = ln(2) ln 1 + e D and fc(T) = ln 2 e T . From Theorem 7, we have

D (x) = ln pd(x)

and p g(x) = pd(x) 2e(λp+µp(x)) 1.

Euler-Lagrange Analysis of GANs

Enforcing the integral, non-negativity, and complementary slackness constraints, we obtain the only feasible (and therefore optimal) set of KKT multipliers: µ p(x) = 0, x X+ and λ p = 0. Since both KKT multipliers are zero, it can be veriﬁed that unconstrained optimization over pg also yields the same solution. Therefore, for this choice of divergence, the constraints are automatically satisﬁed.

SGAN divergence: The f-divergence of the SGAN is closely related to the Jensen-Shannon divergence, but for the ln(2) term (Goodfellow et al., 2014). The SGAN f-divergence formulation considers g(D) = ln 1 + e D and fc(T) = ln 1 e T . Applying Theorem 7 gives

D (x) = ln pd(x)

, and p g(x) = pd(x) e(λp+µp(x)) 1. (36)

The optimal output function corresponding to D (x) is given by

T (x) = g(D (x)) = ln pd(x) pd(x) + pg(x)

= ln (D SGAN(x)) ,

where D SGAN is the optimal discriminator corresponding to the SGAN formulation proposed by Goodfellow et al. (2014). As in the KL and reverse KL divergence based f GANs, the only feasible KKT multiplier associated with the non-negativity constraint is µ p(x) = 0, x such that pd(x) > 0, and the associated multiplier for the equality constraint is given by λp = ln(2). In summary, the optimal KKT multipliers are: µ p(x) = 0, x X+, and λ p = ln(2).

Pearson-χ2 divergence: The Pearson-χ2 divergence corresponds a special case of the LSGAN loss. The f-GAN formulation considers g(D) = D and fc(T) = 1

4T 2 + T. The associated optimal discriminator and generator functions, derived following Theorem 7, are given by

D (x) = 2 pd(x) pg(x)

and p g(x) = pd(x) p

λp + µp(x) + 1 ,

respectively. Applying the integral constraint Ωpg, we obtain λp + µp(x) = 0. In addition, enforcing the non-negativity and complementary slackness conditions yields µp(x) = 0, x, such that pd(x) > 0, as the only feasible solution. Similar to the Jensen-Shannon divergence case, we obtain the degenerate case of KKT multipliers: µ p(x) = 0, x X+ and λ p = 0, corresponding to the solution obtained in the unconstrained optimization problem.

Squared Hellinger divergence: The f-GAN associated with the squared Hellinger divergence has g(D) = 1 e D and fc(T) = T 1 T . Applying Theorem 7, we obtain

and p g(x) = pd(x) (λp + µp(x) + 1)2 , (37)

respectively. Enforcing the integral constraints, we obtain the conditions:

λp + µp(x) = 0 or λp + µp(x) = 2.

Asokan and Seelamantula

The optimal KKT multiplier corresponding to the inequality constraint, as in the previous cases, is µ p(x) = 0, x X+. The corresponding feasible set for λp is { 2, 0}. To ﬁnd the optimal λp, we consider the dual optimization problem associated with only the integral constraint:

λ p = arg max λp { 2, 0}

g(λp) = inf pg LG(pg, D ) ,

= arg max λp { 2,0} LG(p g, D ).

Substituting for D (x) and p g(x) from Equation (37), considering µ p(x) = 0, we obtain:

λ p = arg max λp { 2, 0}

λ2 p λp + 1

By inspection, λ p = 2. Hence, the optimal KKT multipliers are µ p(x) = 0, x X+ and λ p = 2. In addition to the f-divergence based GAN losses considered above, there exist two closely related variants: (i) The non-saturating SGAN (SGAN-NS) (Goodfellow et al., 2014) that alleviates the vanishing gradient problem in a practical GAN setting, and (ii) The general least-squares GAN (Mao et al., 2017) setting with class labels (a, b, c).

A.3 Euler-Lagrange Analysis of the Non-saturating SGAN

The non-saturating SGAN (SGAN-NS) proposed by Goodfellow et al. (2014) is a practical alternative to the SGAN loss, which ensures that the gradients do not vanish while training the generator and discriminator networks. However, it does not ﬁt within the framework of divergence based GANs. Therefore, a straightforward divergence minimization argument does not apply, nevertheless a variational analysis can be carried out. Since the discriminator losses in both SGAN and SGAN-NS are the same, the optimal discriminator given by D (x) = pd(x) pd(x) + pg(x) remains unchanged. Given the optimal discriminator, the Lagrangian of the

generator loss, taking into account the integral constraint Ωpg, is given as follows:

LSGAN NS G = Ex pg[ln(D (x))] Ex pd[ln (1 ln(D (x))) + λp

X pg(x) dx 1 ,

X ( pg(x) ln(D (x)) pd(x) ln (1 D (x)) + λppg(x)) dx λp.

Applying the EL condition yields:

ln(D (x)) + D (x) 1 D (x)

pg=p g = λp + 1, (38)

pd(x) pd(x) + p g(x)

= eλp+1. (39)

The optimal generator distribution p g(x) is the one that solves the above transcendental equation. While no closed-form approaches exist for solving Equation (39), one could solve it approximately.

Euler-Lagrange Analysis of GANs

A second alternative suggested by Goodfellow et al. (2014) involves the removal of the expectation term associated with pd in LSGAN NS G , since this term does not contribute toward the training of the generator network in practice. Incorporating this modiﬁcation gives us the following Lagrangian of the generator loss:

LSGAN NS G = Ex pg[ln(D (x))] + λp

X pg(x) dx 1 ,

X pg(x) ln(D (x)) + λp dx λp.

Applying the EL condition gives the following transcendental equation:

D (x)e D (x) pg=p g = eλp+1, (40)

pd(x) pd(x) + p g(x)

exp pd(x) pd(x) + p g(x)

= eλp+1. (41)

Equation (41) can be solved through the principal branch of the Lambert-W function W0( ) (Lambert, 1758; Corless et al., 1996; Bateman and Erdelyi, 1953), where the equation yey = z for y, z R has a solution, if and only if z 1/e. For z 0, the solution is unique and is given by y = W0(z). Noting that the right-hand size of Equation (41) is always non-negative, we write:

pd(x) pd(x) + p g(x) = W0 eλp+1

p g(x) = W0 eλp+1

1 W0 eλp+1 pd(x). (42)

The optimal generator p g(x) = pd(x) requires W0 eλp+1 = 0.5, which is achieved when λp 1.1935. A similar link with the Lambert-W function was observed in the context analyzing SGAN-NS with inﬁnite-width discriminators employing the neural tangent kernel (Franceschi et al., 2021) In existing implementations, the integral constraint on pg is not imposed explicitly. Instead, the generator network in trained to learn the inversion mapping implicitly. It was observed that, in practice, while SGAN-NS alleviates the problem of vanishing gradients, the training procedure is more sensitive to hyper-parameter tuning than the training of SGAN (Fedus et al., 2018). We attribute the sensitivity to the implicit inversion of the transcendental equations (38) or (40), in comparison with the linear mapping present in the saturating SGAN variant (cf. Equation (36)).

A.4 Euler-Lagrange Analysis of the Least-squares GAN

As a generalization to the Pearson-χ2 divergence based f-GAN, we consider the least-squares GAN formulation presented by Mao et al. (2017). Consider the LSGAN discriminator and generator costs given by

LLSGAN D = 1

2 Ex pd[(D(x) b)2] + 1

2 Ex pg[(D(x) a)2], and

LLSGAN G = 1

2 Ex pd[(D (x) c)2] + 1

2 Ex pg[(D (x) c)2] + λp

X pg(x) dx 1 ,

Asokan and Seelamantula

respectively, where D (x) is the minimizer of LLSGAN D . The discriminator learns a regression model onto the target labels a and b for the generated and true data samples, respectively. The generator learns to output images that are classiﬁed by the discriminator with target label c. The generator loss is subjected to only the integral constraint, as the analysis in Appendix A.2 shows that the non-negativity constraint is met automatically. To optimize the discriminator loss, we consider LLSGAN D in its integral form:

LLSGAN D = 1

(D(x) b)2pd(x) + (D(x) a)2pg(x) dx.

Applying the EL conditions yields the optimal discriminator function given a generator distribution:

D (x) = a pg(x) + b pd(x)

pg(x) + pd(x) .

Given the optimal discriminator, consider the generator loss:

LLSGAN G = 1

(D (x) c)2 (pd(x) + pg(x)) + λppg(x) dx λp.

Applying the EL condition gives

(D (x) c)2 + 2(D (x) c)(pg(x) + pd(x)) D (x)

pg + λp = 0.

Algebraic simpliﬁcation results in the following quadratic in pg:

p2 g(x) (a c)2 + λp + 2pg(x)pd(x) (a c)2 + λp + p2 d(x) ((b c)(2a b c) + λp) = 0.

Solving for the optimal generator distribution, we obtain:

p g(x) = pd(x) pd(x)

(a c)2 + λp .

Only the positive root yields a valid solution. Applying the integral constraint gives the optimal Lagrange multiplier

λ p = (a b)2

The choice of b c = 1 and b a = 2 proposed by Mao et al. (2017) gives λ p = 1, and the optimal generator p g(x) = pd(x). Similarly, setting b a = 2 and a c = 1 yields λ p = 0, and the corresponding GAN loss minimizes the Pearson-χ2 divergence.

While in theory, inﬁnitely many sets of (a, b, c) solve for the desired optimum p g(x) = pd(x), those labels that correspond to the Pearson-χ2 loss are preferred as the corresponding p g automatically satisﬁes the integral constraint without requiring an explicit penalty term.

Euler-Lagrange Analysis of GANs

Appendix B. Optimality of WGAN-GNP (1-D)

In this appendix, we present the proofs for the optimal WGAN-GNP Lagrange multiplier and generator function in the 1-D setting.

B.1 Optimal Lagrange Multiplier (1-D)

Consider the optimal discriminator function in 1-D given by

D (x) = 1 2λd (φ (pg pd)) (x) + a1x + a0.

Its derivative is given by

x y |x y| (pg(y) pd(y)) dy + a1

= 1 4λd (sgn (pg pd))(x) + a1,

where sgn(x) = x |x| denotes the signum function. Enforcing the constraint that the gradient must have unit norm, we get the optimal Lagrange multiplier in 1-D:

|x| (pg pd) (x) + a1

To analyze the second-order condition, consider the WGAN integral cost:

L D(x), D (x) = Z

X F x, D(x), D (x) dx.

The Legendre-Clebsch condition states that a minimizer must satisfy the second-order partial-diﬀerential condition

D 2 0, x X.

Consider the integrand F:

F x, D, D = D(x)(pg(x) pd(x)) + λd |D (x)|2 1 ,

FD D = 2λd,

which implies that a positive value for λd results in a minimizer. In summary, considering the positive square-root in the expression for the optimal Lagrange multiplier in (43) results in a minimizer of LD. We now analyze λ d as a function of pd and pg. First, consider the convolution term inside the integral of Equation (43):

(sgn (pg pd)) (x) = Z

(pg(y) pd(y)) sgn(x y)dy

y= (pg(y) pd(y)) dy Z

y=x (pg(y) pd(y)) dy

= Fpg(x) Fpd(x) 1 Fpg(x) (1 Fpd(x))

= 2 Fpg(x) Fpd(x) ,

Asokan and Seelamantula

where Fpg and Fpd denote the cumulative density functions (CDFs) of pg and pd, respectively. Substituting the above into Equation (43), we get:

2 Fpg(x) Fpd(x) + a1 2 dx.

In a practical setting, as the training of the GAN progresses, the Fourier coeﬃcients of pg(x) converge to those of pd(x). As the distributions become closer to one another in the L2 sense, their diﬀerence, and therefore the diﬀerence between their CDFs also reduces. Therefore, upon convergence of the GAN, ideally, we have αm = βm m Z, and consequently p g(x) = pd(x), yielding

X a2 1 dx = |a1|

As observed in Theorem 2, the convergence of pg to pd is independent of the choice of a1, which is part of the homogeneous solution. Therefore, without loss of generality, we set a1 = 0, which gives us the favorable property that, λ d 0 as pg(x) pd(x), which can be used as a proxy for tracking the convergence of the GAN training.

B.2 Optimal WGAN-GNP Generator (1-D)

Consider the Lagrangian of the generator cost in 1-D:

X ((pd(x) pg(x))D (x) + (λp + µp(x))pg(x)) dx λp,

with D (x) given by Equation (19):

D (x) = 1 2λ d

X φ(x y) (pd(y) pg(y)) dy + a1 x + a0,

where X is the convex hull containing the supports of pd and pg. Without loss of generality, we consider the symmetric fundamental solution φ(x) = 1

2|x| + b0. Since the integrand involves a convolution integral, it is not in the standard form considered in Equation (1). Hence, the EL conditions cannot be applied directly. Starting from ﬁrst principles, we evaluate the ﬁrst variation of LG and set it to zero to obtain the optimizer. Consider an ϵ-perturbation of the loss LG about the optimal generator p g, denoted by LG,ϵ(pg) = LG(p g(x) + ϵη(x)), where η(x) is a family of compactly supported, absolutely integrable, inﬁnitely diﬀerentiable functions that are identically zero at the boundaries of X. The corresponding perturbed discriminator is represented by D ϵ(x). Consider

LG,ϵ(pg) = Z

D ϵ(x)(pd(x)) p g(x) ϵη(x)) + (λp + µp(x))(p g(x) + ϵη(x)) dx,

X φ(x y) p g(y) + ϵη(y) pd(y)

dy + a1 x + a0

= φ (p g + ϵη pd) (x) + a1 x + a0.

Euler-Lagrange Analysis of GANs

Consider the following derivatives with respect to ϵ:

d D ϵ(x) dϵ (pd(x)) p g(x) ϵη(x)) D ϵη(x) + (λp + µp(x))η(x) dx, and

d D ϵ(x) dϵ = 1 2λ d

X φ(x y)η(y) dy.

Substituting for D ϵ(x) and its derivative into d LG,ϵ

dϵ , and evaluating it at ϵ = 0 gives the ﬁrst variation in LG:

X φ(x y)η(y)(p g(x) pd(x)) dy dx

2αd(λp + µp(x)) + a1x φ (pd p g) (x) η(x) dx,

where αd = λ d 4 . Next, consider

X φ(x y)η(y)(p g(x) pd(x)) dy dx.

Considering a compact domain of integration and the functions φ, p g, and pd to be absolutely integrable over the domain, we apply Fubini s theorem and interchange the order of integration to obtain

p g(x) pd(x) φ(x y) dx dy.

Since φ is symmetric, T1 simpliﬁes to

X η(y)(φ (p g pd))(y) dy.

Substituting for T1 in the ﬁrst variation yields

2αd(λp + µp(x)) + a1x 2 φ (pd p g) (x) η(x) dx.

Setting LG to zero and invoking the fundamental lemma of calculus of variations (cf. Section 2) gives rise to the following condition

φ (pd p g) (x) = αd(λp + µp(x)) + 1

which the optimal generator p g must satisfy. Rearranging (44) yields

αdµp(x) = φ (pd p g) (x) αdλp 1

Asokan and Seelamantula

from which we obtain the Laplacian:

αdµ p(x) = pd(x) p g(x),

which gives the optimal generator distribution in terms of the KKT multiplier µp(x):

p g(x) = pd(x) αdµ p(x). (45)

An alternative approach using Fourier analysis: The Fourier transform proves to be an eﬃcient tool to arrive at Equation (45) starting from Equation (44). Recall φ(x) = 1

2|x|. Its Fourier transform cannot be deﬁned in the L1 or L2 sense, but only in the sense of distributions: F{φ(x)} = 1

ω2 , ω R {0}. The Fourier transform of x is also given in the distributional sense as F{x} = jδ (ω), where δ (ω) is the derivative of the Dirac delta, and must be understood in a distributional sense by its action on a test function f C (R): δ (ω), f(ω) = f (0). Writing (44) in the Fourier domain gives

1 ω2 ˆp g(ω) ˆpd(ω) = αdλpδ(ω) + αdˆµp(ω) ja1

ˆp g(ω) = ˆpd(ω) + αd λpω2δ(ω) + ω2ˆµp(ω) ja1

where ˆp g(ω) and ˆpd(ω) are the Fourier transforms of p g(x) and pd(x), respectively. From the properties of the Dirac delta, it follows that ω2δ(ω) = 0 and ω2δ (ω) = 0. Hence, we obtain

ˆp g(ω) = ˆpd(ω) + αdω2ˆµp(ω). (46)

Invoking the diﬀerentiation property of the Fourier transform, we have

µ p(x) F ω2ˆµp(ω).

Taking the inverse Fourier transform on both sides of Equation (46) gives

p g(x) = pd(x) αdµ p(x),

which is identical to Equation (45). We observe that p g is independent of λp, while the optimal µ p(x) must be determined, which can be done by enforcing the integral constraint: Z

X p g(x) dx = Z

pd(x) + αdµ p(x) dx = 1

X µ p(x) dx = 0. (47)

Recall from Theorem 2 that µp(x) 0, x X, in order to satisfy the non-negativity constraint. Let us now split X into two disjoint sets:

X0 = {x | pd(x) = 0}, and X+ = {x | pd(x) > 0}.

Consider the complementary slackness condition:

µp(x)p g(x) = µp(x)pd(x) αdµp(x)µ p(x) = 0, x X.

Euler-Lagrange Analysis of GANs

For x X+, we have either µp(x) = 0 or µ p(x) = pd

αd as feasible solutions, but in view

of the condition in Equation (47), the latter becomes invalid. Therefore, for x X+, µp(x) = 0 is the only solution. For x X0, the complementary slackness condition requires that µp(x)µ p(x) = 0. Consequently, either µp(x) = 0 or µp(x) = c1 x + c0. The former solution implies that µp(x) = 0, x X. If we consider the latter and enforce that µp(x) 0, x X, then c1 is necessarily zero, while c0 is a ﬁnite negative value. Consolidating, we get the following optimal solutions for the Lagrangian parameter:

µ p(x) = 0, x X, or µ p(x) =

( 0, x X+, and

< c0 0, x X0. (48)

Both are equally good optima as far as satisfying the constraints µp(x) 0 and Equation (47) go, because in both cases, the loss function LG evaluates to zero, rendering the speciﬁc choice of c0 irrelevant. Without loss of optimality, we set µ p(x) = 0, x X. The preceding analysis strongly suggests that p g(x) = pd(x), x X is the optimal solution. From Equation (45), we observe that the optimality of p g does not depend on the value of λp. For the loss LG to be ﬁnite, λp must, however, be a ﬁnite real number. Further, the optimality of the generator is unaﬀected by the homogeneous component of the optimal discriminator. In summary, the optima are given as follows:

p g(x) = pd(x), µ p(x) = 0, x X, and λp ( , ).

B.3 Optimal Lagrange Multiplier in WGAN-FS (1-D)

Consider the truncated Fourier-series approximation for the discriminator

D FS(x) = 1 λ FS

γr m cos(ωomx) + γi m sin(ωomx) !

In order to enforce the gradient-norm penalty, we require the square of the derivative:

2 = 1 λ 2 FS

γr mωom sin(ωomx) + γi mωom cos(ωomx) !2

By Cauchy-Schwartz inequality, we have

i=1 u2 i . Using this inequality

allows us to place the following bound:

m=1 ω2 om2 γr2 m sin2(ωomx) + γi2 m cos2(ωomx) !

τ i m + τ r m + τ i m τ r m cos(2ωomx) !

Asokan and Seelamantula

2(γr m)2ω2 om2, and τ i m = 1

2(γi m)2ω2 om2.

Enforcing the gradient-norm penalty gives

τ i m + τ r m + τ i m τ r m cos(2ωomx) !

Simplifying, we obtain:

λ 2 FS (2M + 1)

τ i m + τ r m + τ i m τ r m cos(2ωomx) !

τ i m + τ r m !

τ i m τ r m Z

X cos(2ωomx) dx .

In practice, we have data, and computing a sample estimate of the upper bound over a batch of size N gives the following

v u u t(2M + 1)

m=1 (τ im + τ rm) + 1

m=1 (τ im τ rm) cos(2ωomxk)

Recall that a1 = 1 for the optimal discriminator to satisfy the gradient-norm penalty when p g = pd (cf. Section 3.2). We observed that, in practice, the contribution of a1 is negligible and the upper bound could be used for λ FS.

Appendix C. Optimality of WGAN-GNP (n-D)

In this appendix, we present the multivariate counterparts of the 1-D proofs presented in Appendix B.

C.1 Optimal Lagrange Multiplier (n-D)

Consider the optimal WGAN discriminator in n-D (n 3):

1 x y n 2 (pg(y) pd(y)) dy + a, x + constant,

where a = [a1, a2, . . . , an]T, κn = (n(n 2)v(n)) 1, and X is the convex hull of the supports of pd and pg, where in turn, v(n) is the volume of the unit sphere in Rn given by v(n) = π n 2 Γ n

2 + 1 1. Consider the partial derivative with respect to xi, the ith element of x:

xi = 1 2nv(n)λd

xi yi x y n (pg(y) pd(y)) dy + ai.

Euler-Lagrange Analysis of GANs

Squaring and summing over all i gives

D 2 2 = 1 4n2v(n)2λ2 d

xi yi x y n (pg(y) pd(y)) dy + ai

Enforcing the gradient-norm penalty: R

X D 2 2 1 dx = 0 gives us the following condition on the optimal Lagrange multiplier λ d for a given optimal discriminator:

v u u t 1 4n2v(n)2|X|

xi yi x y n (pg(y) pd(y)) dy + ai

where |X| = R

X 1 dx denotes the volume of X. We next have to determine the appropriate sign of λ d, which is obtained by considering the second-order necessary conditions for optimality. Consider the n-D cost:

LD D(x), {D i(x)}n i=1 = Z

X F x, D(x), {D i(x)}n i=1 dx,

where D i(x) = D

xi . Recall that the Legendre-Clebsch condition (cf. Section 2) in the multidimensional case translates to positive-deﬁniteness of the Hessian matrix H of the Hamiltonian H of the cost LD computed with respect to {D i}n i=1, evaluated at λd = λ d, D(x) = D (x):

λd=λ d; D(x)=D (x) 0,

where denotes positive-deﬁniteness. The entries of HD,H are given by

[HD,H]i,j = 2H D i D j , where the Hamiltonian H =

Considering the integrand F of the WGAN-GNP cost given in Equation (13), the Hamiltonian turns out to be

D(x)(pg(x) pd(x)).

Evaluating the Hessian with respect to D i yields the following:

[HD ,H]i,j =

( 2λd, for i = j, and

0, for i = j.

λd=λ d; D(x)=D (x) = 2 λ d In,

where In is the n n identity matrix. This condition is analogous to the 1-D case, where, picking the positive square-root for λ d in Equation (49) results in D (x) being a minimizer of the chosen cost.

Asokan and Seelamantula

C.2 Optimal WGAN-GNP Generator (n-D)

The derivation of the optimal generator p g proceeds along the lines of the ﬁrst-variation analysis in 1-D, taking into account the fact that the generator cost does not involve terms containing the derivatives of pg. Consider the Lagrangian

X ((pd(x) pg(x))D (x) + (λp + µp(x))pg(x)) dx λp,

where D (x) is as given in Equation (19):

X φ(x y) (pd(y) pg(y)) dy + a, x + constant,

with φ(x) = κn x 2 n, where x Rn, n 3, κn = 1 n(n 2)v(n), v(n) is the volume of the unit sphere in Rn, and X is the convex hull of the supports of pd and pg. Denote the optimal generator as p g(x). Consider the perturbation p g(x) + ϵ η(x), where η(x) is the n-dimensional counterpart of η(x) deﬁned in Appendix B.2. The ﬁrst variation LG is given by

X φ(y)η(x y)(p g(x) pd(x)) dy dx

2αd (λp + µp(x)) + a, x φ (pd p g) (x) η(x) dx

where αd = λ d 4 . The term T1 involves a convolution with a singular kernel φ(y), with the singularity at the origin. The integrals therefore have to be evaluated in the Cauchy principal-value sense. We make the interpretation explicit by deﬁning:

X ( ) dx = lim ξ 0

where X ξ = X B(0, ξ), which is formed by removing a ball of radius ξ centered at the origin. Recall that X is assumed to be compactly supported, and hence X ξ is compactly supported as well. Consider η to be absolutely integrable over X ξ. Applying Fubini s theorem to T1 yields

T1 = lim ξ 0

p g(x) pd(x) η(x y) dx dy,

X ξ φ(y) p g(x + y) pd(x + y) η(x) dx dy.

Swapping the order of integration yields

T1 = lim ξ 0

X ξ η(x) φ p g pd (x) dx,

Euler-Lagrange Analysis of GANs

since φ is radially symmetric. Substituting T1 back into LG, setting it to zero, and invoking the fundamental lemma of calculus of variations (cf. Section 2), we obtain the condition

φ (p g pd) (x) = αd(λp + µp(x)) + 1

2 a, x , (50)

which the optimal generator p g must satisfy. Applying the Laplacian operator on both sides of Equation (50) and noting that φ(x) = δ(x) from Theorem 3, we get

p g(x) = pd(x) + αd µp(x). (51)

An alternative approach to obtaining Equation (51) from (50) involves the use of the Fourier transform of distributions. The Fourier transform of a, x can be deﬁned in the distributional sense as follows:

| {z } δ i(ω)

where δ (ωi) denotes the derivative of the Dirac delta considered along ωi, which is the ith

element of ω = [ω1, ω2, . . . , ωn]T. Similarly, consider the n-dimensional radially symmetric function:

fτ(x) = 2 τ

where x Rn and Γ( ) denotes the gamma function. Gelfand and Shilov (1958) showed that the Fourier transform of fτ(x) is also radially symmetric, and has an expression given again in terms of fτ as follows:

F{fτ(x)} = (2π) n 2 f n τ(ω). (52)

Multiplying Equation (50) by 2 n 2

f2 n(x) (pd p g) (x) = 2 n 2

2 αd κn (λp + µp(x)) +

Taking the Fourier transform on both sides, we get

f 2(ω) ˆpd(ω) ˆp g(ω) =

(λpδ(ω) + ˆµp(ω)) +

j aiδ i(ω) , (53)

where ˆp g(ω), ˆpd(ω), and ˆµp(ω) are the n-dimensional Fourier transforms of p g(x), pd(x), and µp(x), respectively. Rearranging the terms, we get

ˆp g(ω) = ˆpd(ω)

(λpδ(ω) + ˆµp(ω)) ω 2

j aiδ i(ω) ω 2.

Asokan and Seelamantula

It can be veriﬁed that ω 2δ(ω) = 0, and ω 2δ i(ω) = 0. Consequently, the above equation simpliﬁes to

ˆp g(ω) = ˆpd(ω) +

( ω 2 2)ˆµp(ω). (54)

Invoking the derivative properties of the n-dimensional Fourier transform, we have

F ω2 i ˆµp(ω),

F ω 2 2 ˆµp(ω).

Taking the inverse transform on both sides of (54) gives the optimal generator

p g(x) = pd(x) +

2 αd (2π) n 2 κn

Recall that κn = 1 n(n 2)v(n), where v(n) = π n 2 Γ( n

2 +1), which is the volume of the unit hyper-

sphere in Rn. Substituting into the above yields

p g(x) = pd(x) + αd µp(x),

which is in agreement with the solution obtained in (51). The next step would be to determine the optimal KKT multipliers λp and µp(x). The analysis follows analogously to the 1-D case, by replacing the second derivative operator with the Laplacian operator. Consider splitting X into disjoint sets X+ = {x | pd(x) > 0} and X0 = {x | pd(x) = 0}. Enforcing the integral, non-negativity, and complementary slackness constraints on p g(x) yields the following choices for µ p(x):

µ p(x) = 0, x X, or µ p(x) =

( 0, x X+, and

< c0 0, x X0. (55)

Either choice of µ p(x) results in the generator loss LG evaluating to zero, and the optimal generator distribution matching the data distribution: p g(x) = pd(x). Further, the optimality of the generator is independent of the value of λp. Summarizing, the optima are:

p g(x) = pd(x); µ p(x) = 0, x X and λp ( , )

The preceding derivation considered the n-dimensional case with n 3. The n = 1 case was presented in Appendix B.2. The analysis for n = 2 is presented next. The analysis follows analogously to the n 3 case up until Equation (50). Thereafter, the diﬀerence lies in the Fourier transform of φ(x). We have

2π ln( x ), x Rn {0}, n = 2 κn x 2 n n 3

Euler-Lagrange Analysis of GANs

To determine the Fourier transform of φ(x) = 1

2π ln( x ), we must pay attention to the singularity at the origin. The Fourier transform must be deﬁned in the distributional sense.

Vladimirov (1984) showed that the Fourier transform of 1 x 2 , x Rn {0} is given in the

distributional sense as follows:

= 2π ln( ω ) 2πC0, ω Rn {0},

where in turn, J0(u) is the zeroth-order Bessel function of the ﬁrst kind (Abramowitz, 1974). From the duality property of the Fourier transform, we have

F {φ(x)} = F 1

2π ln( x ) = 1

2π 1 ω 2 + C0

From here on, the rest of the analysis corresponding to the optimal WGAN-GNP generator proceeds as in the n 3 case in particular, Equation (53) onward, where f 2(ω) is replaced with the above Fourier transform. Ultimately, the optimal solution, p g(x) = pd(x), remains unchanged.

C.3 Optimal Lagrange Multiplier in WGAN-FS (n-D)

Consider the Fourier-series (FS) discriminator D FS(x) in the multivariate case:

D FS(x) 1 λ 2 FS

a, x + constant + X

γr m cos(ωo m, x ) + γi m sin(ωo m, x ) !

Taking the derivative with respect to xℓand squaring, we get:

2 = 1 λ 2 FS

γr mωomℓsin(ωo m, x ) + γi mωomℓcos(ωo m, x ) !2 .

Using the Cauchy-Schwartz inequality:

we obtain the following bound:

m M ω2 om2 ℓ γr2 m sin2(ωo m, x ) + γi2 m cos2(ωo m, x ) !

Asokan and Seelamantula

where |M| is the cardinality of the set M of the selected harmonics. Summing over ℓyields:

D 2 2 2|M| + 1

m M ω2 o m 2 γr2 m sin2(ωo m, x ) + γi2 m cos2(ωo m, x ) !

D 2 2 2|M| + 1

τ i m + τ r m + τ i m τ r m cos(2ωo m, x ) !

where τ r m = 1

2(γr m)2ω2 o m 2, and τ i m = 1

2(γi m)2ω2 o m 2.

Enforcing the gradient-norm penalty: R

X D 2 2 1 dx = 0, gives

τ i m + τ r m + τ i m τ r m cos(2ωo m, x ) !

Simplifying the above gives the condition on the optimal Lagrange multiplier:

λ 2 FS (2|M| + 1)

τ i m + τ r m + τ i m τ r m cos(2ωo m, x ) !

= (2|M| + 1)

τ i m + τ r m + X

τ i m τ r m |X|

X cos(2ωo m, x )dx !

Given the data

D = {xk} = {xd, s.t. xd pd} [ {xg, s.t. xg pg}

of cardinality |D| = N, we can estimate the upper bound on λ FS as follows:

v u u t(2|M| + 1)

m M (τ im + τ rm) + 1

m M (τ im τ rm) cos(2ωo m, xk )

Recall that a = 1 for the optimal discriminator to satisfy the gradient-norm penalty ΩD when p g = pd (cf. Section 4). In practice, the contribution of a was found to be negligible in comparison with the other terms. The worst-case choice for the Lagrange multiplier is

v u u t(2|M| + 1)

m M (τ im + τ rm) + 1

m M (τ im τ rm) cos(2ωo m, xk )

Appendix D. Fourier-series Error Analysis

In this appendix, we analyze the various sources of error in approximating the inﬁnite Fourier series of the generator and data distributions, and the discriminator, and derive upper bounds for the mean-squared error when truncating the Fourier series, and when computing the Fourier coeﬃcients through sample estimates. Our analysis is inspired by the 1-D analysis reported in Giardina and Chirlian (1972). We also generalize the results to higher dimensions, which is pertinent to the present discussion.

Euler-Lagrange Analysis of GANs

D.1 Fourier-series Truncation Error (1-D)

Consider the inﬁnite and truncated Fourier series expansions of pd, pg and D (x), given as follows:

m Z αmejωomx, and pd(x) =

m= M αmejωomx,

m Z βmejωomx, and pg(x) =

m= M βmejωomx, and

m Z γmejωomx, and DFS(x) = 1

m= M γmejωomx.

Considering the Fourier expansion of pd, the mean-squared error in truncation is given by

ϵ2 pd = pd(x) pd(x) 2 2 = Z

(pd(x) pd(x))2 dx.

Applying Parseval s identity, we get

ϵ2 pd = (2T)

m=M+1 |αm|2.

A continuously diﬀerentiable and compactly supported function pd C1 c is of bounded variation over the support X, denoted by VX [pd] = R

X |p d(x)| dx Bd. Consider the modulus of the Fourier coeﬃcient αm:

2 pd(x)e jωomxdx

2 pd(x) d(e jωomx)

Integrating by parts, and noting that pd(x)e jωomx is T-periodic, we have:

|αm| = 1 ωom T

2 e jωomx d (pd(x))

1 ωom T VX [pd(x)] . (56)

This gives us the bound |αm| Bd 2πm. Substituting this result into the mean-squared error yields

ϵ2 pd B2 d πωo

Asokan and Seelamantula

Bounding the tail sum with an integral results in the following:

ϵ2 pd B2 d πωo

ϵ2 pd B2 d πωo

Similarly, the mean-squared error in the truncation of pg is bounded as follows:

where Bg is the bound on the variation of pg. These truncation errors result in the following error in the truncated Fourier series of the discriminator:

ϵ2 D = (2T)

m=M+1 |γm|2 = 2T

Using the inequality (a b)2 2(a2 + b2) gives

|αm|2 + |βm|2

From the result in Equation (56), we have

ϵ2 D 4(B2 d + B2 g) ω3o

Bounding the tail sum with an integral gives

ϵ2 D 4(B2 d + B2 g) ω3o

= 4(B2 d + B2 g) 3ω3o M3 ,

which is the desired bound on the truncation error of the Fourier-series expansion of the discriminator.

D.2 Fourier-series Truncation Error (n-D)

The inﬁnite and truncated complex Fourier-series expansions of D(x) are given by

m Zn γmejωo m,x , and DFS(x) = 1

m [M]n γmejωo m,x ,

Euler-Lagrange Analysis of GANs

respectively, where x Rn, m = [m1, m2, . . . , mn]T Zn +, [M]n denotes the Cartesian product space { M, M + 1, . . . , M 1, M}n and ωo is the fundamental frequency common to all the dimensions. The mean-squared error in truncation is given by

ϵ2 D = DFS(x) DFS(x) 2 2 = Z

DFS(x) DFS(x) 2 dx.

From Parseval s identity, we have:

ϵ2 D = (2T)n X

m2=M+1 . . .

mn=M+1 |γm|2

m Zn +\[M]n + |γm|2,

where [M]n + denotes the Cartesian product space {1, 2, . . . , M 1, M}n. Substituting for γm from Equation (22) gives

ϵ2 D = (2T)n

m Zn +\[M]n +

From Cauchy-Schwartz inequality, we have |αm βm|2 2 α2 m + β2 m . Substituting into the above equation, we have

m Zn +\[M]n +

|αm|2 + |βm|2

The right-hand side of the truncation error can be improved by invoking additional smoothness assumptions on pd and pg. Consider pd and pg to be in Cℓ(X) T W k,2(X); ℓ> k (cf. Assumption 2). The fact that the functions are in Cℓ(X) ensures that the derivatives of pd and pg are well-deﬁned in classical sense (as opposed to the weak derivatives, that are considered in the case of Sobolev spaces). Recall the deﬁnition of the Sobolev-k space W k,2

that subsumes Ck(X) functions:

f ; f W k,2 =

i=1 f(i) 2 2

n f ; f W k,2 = f 2 2 + f(k) 2 2 < o ,

where f(i) denotes the vector consisting of all ith partial derivatives of f, and the equivalence holds in the sense of the topology induced by the norms f W k,2 and f W k,2 (Sobolev, 1963). Furthermore, when considering the Fourier-series expansion of functions in W k,2, the following equivalences hold in terms of the Fourier coeﬃcients holds (Sobolev, 1963):

f L2(X) ; s.t. X

1 + m 2 + m 4 + . . . + m 2k |fm|2 <

Asokan and Seelamantula

where {fm} denote the Fourier coeﬃcients of f. From the bound in Equation (58), for the inﬁnite sum to converge, given data in Rn, we require the individual terms |fm|2 to decay at a rate greater than m (2k+n). Therefore, there exists a constant Mf < such that, for ﬁnite k, we have

|fm|2 Mf m (2k+n+1), m.

While a similar bound can be derived using the exponent (2k + n + τ), τ = 1, 2, . . ., we set τ = 1 to obtain a tight bound. This bound on |fm| is adequate for the inﬁnite sum in Equation (58) to be bounded, as shown in Appendix D.2.1. Substituting the above bound into Equation (57) yields

ϵ2 D (2T)n(Mpd + Mpg)

m Zn +\[M]n + m (2k+n+4).

The sum can be bounded by considering an appropriate integral as shown below:

m Zn\[M]n m (2k+n+4) Z

y2=M . . . Z

yn=M y (2k+n+4) dy1dy2 . . . dyn,

where yi is the ith entry in y. Converting to n-dimensional spherical coordinates and simplifying, we get

m Zn\[M]n m (2k+n+4) Sn 1

r=M n r (2k+5) dr,

where Sn 1 = π n 2 Γ n

2 1 is the hyper-surface area of the n-dimensional unit sphere. Evaluating the integral on the right-hand side yields the following bound:

m Zn\[M]n m (2k+n+4) Sn 1 (2k + 4) (M n) (2k+4).

Using this bound gives us the desired result of Theorem 5:

ϵ2 D (2T)n(Mpd + Mpg)Sn 1

2ω4o | {z } Cn,T

(M2n) (k+2)

(M2n) (k+2)

Extension to Inﬁnitely Diﬀerentiable Functions: The above analysis can be extended to the case when pd and pg belong to C (X), i.e., they are inﬁnitely diﬀerentiable. In this case, the deﬁnition in Equation (58) holds for all k, and we have:

|fm|2 M fe m 2,

where M fm is a constant dependent on f, that is, the Fourier coeﬃcients decay exponentially. Substituting the above in ϵ2 MS and bounding the sum by the n-dimensional integral in

Euler-Lagrange Analysis of GANs

hyperspherical coordinates gives

(2T)n(M pd + M pg)

r=M n e r2rn 1 dr

(2T)n(M pd + M pg)

s=M2n e ss n 2 1 ds,

where the equality results from the change of variable s = r2. The integral is the upper incomplete gamma function Γ n

2 , M2 n , This gives us the bound

(2T)n(M pd + M pg)

| {z } C n,T

Asymptotically, as M , the upper incomplete gamma function satisﬁes the property: Γ n

2 , M2 n (M2n) n 2 e M2n.

D.2.1 Bound on the Fourier Coefficients

Consider the Sobolev-k space W k,2. For functions drawn from W k,2(X), the coeﬃcients of the Fourier-series expansion {fm} must satisfy (Sobolev, 1963):

1 + m 2 + m 4 + . . . + m 2k |fm|2

= |f0|2 + X

1 + m 2 + m 4 + . . . + m 2k |fm|2 < . (59)

Consider the bound on the Fourier coeﬃcients for ﬁnite k, given by:

|fm|2 Mf m (2k+n+1), m.

Since we are working with Fourier-series representations of p.d.f.s, we have f0 = 1. Therefore

Sk 1 + Mf X

1 + m 2 + m 4 + . . . + m 2k m (2k+n+1)

m (2k+n+1) + m ((2k 2)+n+1) + . . . + m (n+1)

1 + (k + 1) Mf X

m Zn\{0} m (n+1),

where the second inequality is obtained by approximating each term in the summation by the largest one. Further bounding the inﬁnite sum by an appropriate integral as follows:

Sk 1 + 2(k + 1) Mf

y1=1 . . . Z

yn=1 y (n+1) dy1 . . . dyn,

Asokan and Seelamantula

and converting to n-dimensional spherical coordinates, we get

Sk 1 + (k + 1) Mf Sn 1

r= n r 2 dr,

where Sn 1 is the hyper-surface area of the n-D unit sphere. Evaluating the integral yields

Sk 1 + (k + 1) Mf Sn 1n 1

which is ﬁnite.

D.3 Error in the Sample Estimation of Fourier Coeﬃcients

In Appendix D.2, we analyzed the eﬀect of truncating the Fourier series with a rectangular sum of M terms along each dimension. In practice, in addition to truncating the Fourier series, there is also error arising out of approximating the Fourier coeﬃcients of pd and pg. Consider the trigonometric Fourier expansion of pd:

m Zn + αr m cos(ωo m, x ) + αi m sin(ωo m, x )

and its Mth-order approximation:

m [M]n + αr m cos(ωo m, x ) + αi m sin(ωo m, x ),

where x Rn, Zn + denotes the set of positive non-zero integers, [M]n + represents the product space {1, 2, . . . , M}n, m = [m1, m2, . . . , mn] Zn +, and αr m and αr m are the true Fourier coeﬃcient and its N-sample estimate given by

X cos(ωo m, x )pd(x) dx = Ex pd [cos(ωo m, x )] , and

cos(ωo m, xk ),

respectively. Considering independent and identically distributed (i.i.d.) samples {xi}, it can be shown that

Ex [ αr m] = 1

Ex [cos(ωo m, xk )] = αr m

Similarly, the variance of the estimate is given by

Var ( αr m) = Ex h ( αr m)2i (Ex [ αr m])2 = Ex h ( αr m)2i (αr m)2 . (60)

Euler-Lagrange Analysis of GANs

Expanding the ﬁrst term on the right-hand side yields

Ex h ( αr m)2i = 1 N2 Ex

i=1 cos (ωo m, xi )

i=1 Ex cos2 (ωo m, xi )

Ex [cos (ωo m, xi ) cos (ωo m, xj )] .

For i.i.d. samples {xi}, we have

Ex h ( αr m)2i = 1 N2

i=1 Ex cos2 (ωo m, xi )

Ex [cos (ωo m, xi )] Ex [cos (ωo m, xj )]

i=1 Ex cos2 (ωo m, xi ) + N2 N

Applying the half-angle trigonometric formulae gives

Ex h ( αr m)2i = 1 N2

2 cos (2ωo m, xi ) + N2 N

2N Ex [cos (2ωo m, xi )] + N2 N

Ex h ( αr m)2i = 1 2N + 1

2N αr 2m + N2 N

Substituting the above into Equation (60) yields:

Var ( αr m) = 1

2αr 2m (αr m)2 . (61)

A similar analysis for αi m gives

Ex αi m = αi m, and Var αi m = 1

2αr 2m (αr m)2 . (62)

We ﬁrst bound the mean-squared error between the target expansion pd(x) and the approximation pd(x) in general, and subsequently, specialize the result for inﬁnitely diﬀerentiable functions. Consider the error

ϵ2 pd = pd(x) pd(x) 2 2 = Z

X (pd(x) pd(x))2 dx.

Asokan and Seelamantula

From Parseval identity, we have:

ϵ2 pd = | α0 α0|2

| αr m αr m|2 + | αi m αi m|2 + X

m Zn +\[M]n +

|αr m|2 + |αi m|2 .

The analysis for m = 0 can be accounted for in αr m, and we have Ex[ αr 0] = αr 0 = 1 with Var( αr 0) = 0. Taking the expected value of ϵ2 MS with respect to x, we get

Ex ϵ2 pd = X

m [M]n + Ex h ( αr m αr m)2i + Ex h αi m αi m 2i + X

Zn +\[M]n + |αr m|2 + |αi m|2

m [M]n + Var( αr m) + Var( αi m)

Zn +\[M]n + |αr m|2 + |αi m|2

where T1 the statistical component of the error, caused by the error in approximating the Fourier coeﬃcient by replacing expectations with their samples estimates, and T2 is deterministic, given the choice of the truncation frequency M. Substituting for the variance terms and simplifying, we get:

Ex ϵ2 pd = 1

1 |αr m|2 |αi m|2 + X

Zn +\[M]n +

|αr m|2 + |αi m|2 ,

m [M]n + |αm|2 + X

Zn +\[M]n + |αm|2, (63)

where αm = αr m + j αi m are the coeﬃcients of the exponential Fourier series. Akin to the analysis in Appendix D.2, consider mpd m (2k+n+1) |αm|2 Mf m (2k+n+1), where mpd < Mpd. Employing these bounds yields

Ex ϵ2 MS Mn

m [M]n + m (2k+n+1)

m Zn +\[M]n + m (2k+n+1)

We can bound the elements in sum S1 considering m = [1, 1, . . . , 1]T. S2 can be bounded by the integral in hyperspherical coordinates similar to the procedure employed in Appendix D.2:

m Zn +\[M]n + m (2k+n+1)

y2=M . . . Z

yn=M y (2k+n+1) dy1dy2 . . . dyn

r=M n r (2k+2) dr,

where Sn 1 is the surface area of the n-D unit hypersphere. For ﬁnite k, we have

Sn 1 (2k + 1) n(k+ 1

| {z } C n,k

1 M2k+1 = C n,k 1 M2k+1 .

Euler-Lagrange Analysis of GANs

Substituting for S1 and S2 gives

Ex ϵ2 pd Mn

1 mpd nk+ n+1

| {z } εstat

+ Mpd C n,k 1 M2k+1 | {z } εtrunc

where εstat and εtrunc denote the statistical and deterministic contributions to the error, respectively. Observe that as the dimensionality of data increases, the batch size must increase at a rate of N Mn+1 for the approximation error εstat to decay. For a given N, increasing M results in poorer estimates of the Fourier coeﬃcients. One requires more samples to estimate the high-frequency components accurately, failing which, undesirable oscillations will appear in the representation. Experimental illustrations of this oscillation phenomenon will be presented in Appendix E.1. The contribution of εstat associated with the sample estimation of αm is larger than the truncation error εtrunc for most M. This results in a trade-oﬀbetween discarding high-frequency components versus poorly estimating them due to insuﬃcient samples. The relative eﬀect of εstat and εtrunc indicates that it is indeed better to discard the high-frequency terms in these scenarios. We restrict our Fourier-series expansions in all experimentation to include up to Mlow = 2 harmonics along all dimensions. For data in n-D, this results in an exponential blow-up of terms at the rate of n Mlow for larger M. For instance, with Mlow = 3 for 64-D data, there would be 643 2 105 terms in the Fourier expansion. To improve the representation of high-frequency components, we also uniformly randomly sample L1 = O(103) harmonics between Mlow = 2 and Mhigh = 10.

Extension to Inﬁnitely Diﬀerentiable Functions: We now extend the result in Equation (64) to the case when pd and pg are inﬁnitely diﬀerentiable. For C (X) functions, each term in the Fourier series is approximately O(e m 2). Similar to case when k is ﬁnite, there exist two constants m pd < M pd such that:

m [M]n + e m 2

m Zn +\[M]n + e m 2

Each term in S1 can be bound by the largest value, as in the Ck c case. The sum in S2 can be bounded by the hyperspherical integral as shown in Appendix D.2, which yields

m Zn +\[M]n + e m 2 Z

y2=M . . . Z

yn=M e y 2 dy1dy2 . . . dyn

r=M n e r2rn 1dr,

where, as before, Sn 1 denotes the surface area of a unit hypersphere in Rn. As in Appendix D.2, the above integral represents the upper incomplete Gamma function, which gives the bound:

Asokan and Seelamantula

Substituting back for S1 and S2 gives:

Ex ϵ2 pd Mn

N 1 m pde n

| {z } εstat

+ M pd Sn 1Γ n

| {z } εtrunc

where εstat and εtrunc are the statistical and deterministic components of the approximation error, as discussed for the Ck(X) case, which is the desired bound on the approximation for C (X) functions. This gives us a bound on the error when approximating the Fourier series of truncated Gaussian distributions, such as in the case of latent-space matching in Wasserstein autoencoders.

Appendix E. Additional Experimentation

In this appendix, we present additional experiments and results on univariate and multivariate synthetic Gaussian data, and on learning the image-space distributions with WGAN-FS. We also provide additional details on the evaluation metrics used.

E.1 Additional Experiments on 1-D and 2-D Gaussians

To begin with, we present results on learning 1-D and 2-D Gaussians and Gaussian mixtures with the WGAN-FS algorithm. Accuracy of the Fourier-series approximation: The experimental setup is as described in Section 3.7. The fundamental period T is set to 7 in all the experiments. In Figure 18, we present the target distribution pd and its Fourier-series approximation for various choices of truncation order M and batch size N to illustrate the trade-oﬀbetween truncating the Fourier series at low frequencies, and the error in approximating high-frequency coeﬃcients with sparse samples. We observe that, when M is small (e.g., M = 5), introducing additional samples does not improve the quality of the approximation. This is a manifestation of the truncation error (εtrunc) seen in Equation (64). For larger M, (e.g., M 25), we observe that, in line with the theory, the high-frequency terms have a larger variance in their estimate and require larger N to be estimated accurately. This is the statistical component of the error, (εstat), which can be reduced by increasing N. As inferred from Equation (64), the artifacts can be suppressed from the approximation by setting N > Mn+1 (for example, with N = 500 for M = 10 and N = 1000 for M = 25). We observe similar performance trade-oﬀs in the case of learning a bimodal Gaussian mixture in 1-D, as shown in Figure 19. Additionally, when N and M are both small, the Fourier-series approximation fails to capture the smaller mode. Based on these observations, we expect WGAN-FS to perform relatively better with lower M even in the high-dimensional setting. Choosing the fundamental period T : We next present results on varying the assumed period T, given truncation order M and batch size N. Based on the previous experiments, we set M = 10 and N = 100. We consider the 1-D Gaussian learning scenario as described in Section 3.7. The target is a Gaussian N(5, 1), while the noise distribution is N(0, 1). We compare results for various choices of the time period T {2, 5, 7, 11, 25, 75}. Figure 20 compares the quality of the Fourier-series approximation of the target distribution for each value of T. Since a Gaussian is inﬁnitely supported, there will be aliasing in the Fourier representation no matter what the choice of the period is. In order to capture maximum area

Euler-Lagrange Analysis of GANs

under the curve, to keep the aliasing error small, and to prevent the generator from latching on to an aliased version of the target density, we choose T to encompass 12σ supports of both the generator and the target densities in the fundamental period (for example, T 6 for the standard normal distribution). A good choice of the fundamental period T is one that is centered around the generator distribution, but also encompasses the target distribution. For the scenario where the standard normal N(0, 1) is chosen as the noise distribution when learning a target N(µ, σ) we observe that T max{6, µ + 6σ} results in a superior quality of the Fourier-series approximation of the target. Figures 21(a) and (b) plot the Wasserstein-2 distance W2,2 and generator loss LG, respectively, as a function of iterations for various T. We observe that, for small T, the generator latches on to an aliased version of the target, resulting in a large value for W2,2, although the loss LG converges to zero. Choosing a large value of T makes the distribution appear like a spike (high-frequency) in the fundamental period and therefore, an accurate representation requires a larger value of M. For large M, although the Fourier-series approximation is not accurate, the generator samples converge to the desired target samples in terms of W2,2 and LG by virtue of uniqueness of the Fourier representation for a given set of samples. Figure 21(c) shows the learnt discriminator for various choices of T. For small T, the learnt discriminator is unable to classify the target and generator distributions accurately. By virtue of the truncated Fourier-series approximation, the discriminator always learns a smooth approximation of the target classiﬁer. Convergence of the optimal Lagrange multiplier: We next illustrate the suitability of the optimal Lagrange multiplier λ FS to serve as a proxy to measure convergence of the GAN generator during training. Figure 22 shows λ FS and the Wasserstein-2 distance (W2,2) between pd and pg as a function of iterations. We observe that, for higher learning rates (lr 10 1), λ FS does not converge to zero, which may be attributed to the fact that the W2,2 metric measures the convergence only between the ﬁrstand second-order statistics, while λ FS measures the coeﬃcient-wise convergence between the Fourier-series of pd and pg, which indirectly measures the L2 error between the generator and target densities. This suggests that, while the models converge in the Wasserstein-2 sense for higher learning rates, convergence in the L2 sense occurs for lower rates (here, lr 10 2). Based on these results, we set the learning rate to 10 3 for the generator in the subsequent experiments.

Asokan and Seelamantula

Figure 18: (

Color online) Comparison of the quality of the Fourier-series approximation of a Gaussian pd(x) for various batch sizes N and truncation frequencies M.

Euler-Lagrange Analysis of GANs

Figure 19: (

Color online) Comparison of the quality of the Fourier-series approximation of a bimodal Gaussian pd for various batch sizes N and truncation frequencies M.

Asokan and Seelamantula

Figure 20: (

Color online) Comparison of the quality of the 10-component Fourier-series approximation of a Gaussian pd(x) for various choices of the fundamental period T. Underestimating the time period results in aliasing, while overestimating it results in worse approximations of the distribution and requires additional high-frequency components in the expansion to improve upon the quality.

Euler-Lagrange Analysis of GANs

100 200 300 ITERATIONS

W2,2(pd, pg)

100 200 300 ITERATIONS

5 0 5 10 15 x

DISCRIMINATOR D(x)

TARGET NOISE

IDEAL CLASSIFIER T = 2

T = 5 T = 7

T = 11 T = 25

Figure 21: (

Color online) Experiments on 1-D Gaussian data: Comparison of (a) Wasserstein-2 distance W2,2(pd, pg); and (b) Generator loss LG as a function of iterations when training WGAN-FS for various choices of T. For small T, the generator latches on to periodic replicas of the target, resulting in higher W2,2

values but low LG. (c) Comparison of the learnt discriminator when training WGAN-FS for various choices of T. WGAN-FS learns a smooth approximation of the true classiﬁer for all T that contain 12σ windows of the generator and target distribution, thereby avoiding aliasing.

Asokan and Seelamantula

lr = 10 1 lr = 5 10 2

250 500 750 1000 1250 1500 ITERATIONS

W2,2(pd, pg)

LAGRANGIAN MULTIPLIER: λ F S

250 500 750 1000 1250 1500 ITERATIONS

W2,2(pd, pg)

LAGRANGIAN MULTIPLIER: λ F S

lr = 10 2 lr = 10 3

250 500 750 1000 1250 1500 ITERATIONS

W2,2(pd, pg)

LAGRANGIAN MULTIPLIER: λ F S

250 500 750 1000 1250 1500 ITERATIONS

W2,2(pd, pg)

LAGRANGIAN MULTIPLIER: λ F S

lr = 10 4 lr = 10 5

2000 4000 6000 8000 ITERATIONS

W2,2(pd, pg)

LAGRANGIAN MULTIPLIER: λ F S

20000 40000 60000 80000 ITERATIONS

W2,2(pd, pg)

LAGRANGIAN MULTIPLIER: λ F S

Figure 22: (

Color online) Convergence of the optimal Lagrange multiplier λ FS alongside Wasserstein-2 distance between pd and pg (W2,2(pd, pg)) for various learning rates. For higher learning rates, while the model appears to converge in the sense of W2,2(pd, pg), which is a measure only up to second-order statistics, we observe from λ FS that the distributions converge in the L2 sense (the Fourier representation of pg converging to that of pd) only for learning rates lower than 10 2. For very low rates (such as 10 5), the convergence is not smooth. Therefore, we use learning rates in the range [10 2, 10 4] in the subsequent experiments.

Euler-Lagrange Analysis of GANs

E.2 Experiments on n-dimensional Gaussians

We now present experimental results on learning multivariate Gaussian data with truncated Fourier-series expansions for WGAN-FS based on the sampling scheme described in Section 5.1.

Experimental Setup: The experiments are conducted on n-D Gaussian data drawn from N(0.751n, 0.2In), where 1n denotes an n-dimensional vector with all entries equal to 1, and In is the n-dimensional identity matrix. The input to the generator is 100-D Gaussian noise. To simulate the scenario of training on real-world images with the WAE Encoder (Tolstikhin et al., 2018), the noise input is provided to a fully connected layer with 32 32 3 nodes, whose output is reshaped to (32, 32, 3). Subsequently, the reshaped noise vectors are provided as input to a network consisting of four convolution layers with 1024, 256, 128, and 64 ﬁlters in successive layers. The output of the convolution layers is ﬂattened and provided to a fully connected layer with n output nodes. The learning rate is set to 10 2, and batch size to N = 100. Recall that the Fourier-series expansion consists of two levels of approximation, one for the low-frequency part and the other for the high-frequency part. We consider all harmonics up to Mlow, and a set of L distinct uniformly drawn/sampled harmonics between Mlow and Mhigh. We pick 10 n 256 to represent diﬀerent latent space dimensions used in standard autoencoder architectures for images (Tolstikhin et al., 2018).

Results: Figure 23 shows the Wasserstein-2 metric W2,2, generator loss LG and Lagrange multiplier λ FS as a function of iterations, when training WGAN-FS to learn 10-D Gaussian data. We set Mlow = 2 and Mhigh = 10. We experiment on multiple choices of the sample size: L {5, 10, 20, 100, 500, 1000, 10000, 25000}. We observe from Figure 23(a) that the model converges faster for smaller L (for example L 500 in the experiments). However, as seen in Figure 23(b), for small L, the value of LG is higher. From Figure 23(c), we see that for large L (such as L > 103), the convergence of the model in terms of λ FS is slower. We attribute this to the slower convergence of the high-frequency components in the Fourier-series expansions due to increased variance in estimating these components for a given batch size N. This disparity is more pronounced when λ FS is plotted on the logarithmic scale, as seen in Figure 23(b). We therefore chose 102 L 104 to be a good compromise between achieving lower values of the generator loss and faster convergence of the model. The ﬁndings were similar when training the WGAN-FS model on 64-D and 128-D Gaussians (cf. Figures 24 and 25, respectively). In order to motivate the need for latent space matching, we compare the performance of WGAN-FS for various n, given the sampling parameters Mlow = 2, Mhigh = 10 and L = 1000. From Figure 26, we observe that, as n increases, both W2,2 and λ FS exhibit poorer convergence (saturation to higher values). There is also increased jitter in the convergence of the loss and λ FS as n increases. From these results, it is evident that the WGAN-FS discriminator exhibits superior performance on low-dimensional distribution matching, such as in the case of latent-space matching in adversarial or Wasserstein autoencoders (Makhzani et al., 2015; Tolstikhin et al., 2018).

Asokan and Seelamantula

200 400 600 ITERATIONS

W2,2(pd, pg)

500 1000 1500 ITERATIONS

500 1000 1500 ITERATIONS

500 1000 1500 ITERATIONS

L = 5 L = 10

L = 20 L = 100

L = 500 L = 103

L = 2.5 104

Figure 23: (

Color online) Experiments on 10-D Gaussian data: Plots comparing the convergence of: (a) Wasserstein-2 distance W2,2; (b) Generator loss LG; (c) Optimal Lagrange multiplier λ FS, and (d) the natural logarithm of λ FS as a function of iterations when training WGAN-FS with L randomly sampled high-frequency components. The convergence is slower for large L as the error in estimating the coeﬃcients increases with an increase in the number of high frequency terms.

Euler-Lagrange Analysis of GANs

200 400 600 ITERATIONS

W2,2(pd, pg)

200 400 600 800 ITERATIONS

200 400 600 800 ITERATIONS

200 400 600 800 ITERATIONS

L = 10 L = 20 L = 50 L = 100 L = 103 L = 104

Figure 24: (

Color online) Experiments on 64-D Gaussian data: Plots comparing the convergence of: (a) Wasserstein-2 distance W2,2; (b) Generator loss LG; (c) Optimal Lagrange multiplier λ FS, and (d) the natural logarithm of λ FS when training WGAN-FS on 64-dimensional Gaussian data for various number of sampled high-frequency coeﬃcients, L. We observe that λ FS converges to a worse (higher) value for larger L, while Wasserstein-2 distance W2,2(pd, pg) and generator loss LG are worse for small L. Setting L to be around 103 is a viable compromise.

Asokan and Seelamantula

200 400 600 ITERATIONS

W2,2(pd, pg)

500 1000 ITERATIONS

250 500 750 1000 ITERATIONS

250 500 750 1000 ITERATIONS

L=10 L=20 L=50 L=100 L=1k L=10k

Figure 25: (

Color online) Experiments on 128-D Gaussian data: Plots comparing the convergence of: (a) Wasserstein-2 distance W2,2; (b) Generator loss LG; (c) Optimal Lagrange multiplier λ FS, and (d) the natural logarithm of λ FS when training WGAN-FS on 128-dimensional Gaussian data for various number of sampled high-frequency coeﬃcients, L. We observe that the models converge to worse (higher) values of λ FS as L increases. This suggests that Fourier-seriesbased discriminator performs better when fewer high-frequency components are included in the approximation.

Euler-Lagrange Analysis of GANs

250 500 750 1000 ITERATIONS

W2,2(pd, pg)

250 500 750 1000 ITERATIONS

250 500 750 1000 ITERATIONS

200 400 600 800 ITERATIONS

n = 10 n = 16

n = 32 n = 48

n = 64 n = 128

n = 192 n = 256

Figure 26: (

Color online) Plots comparing the convergence of (a) Wasserstein-2 distance W2,2(pd, pg), (b) Generator loss LG, (c) the optimal Lagrange multiplier λ FS, and (d) the natural logarithm of λ FS when training WGAN-FS on n-dimensional data, for various n. Across all three metrics, we observe that the models converge to worse (higher) values as the dimensionality of the data increases. This suggests that Fourier-series-based discriminator performs better on lower-dimensional latent-space matching.

Asokan and Seelamantula

MNIST MNIST (only Digit 4)

(a) (b) Figure 27: Images generated by training WGAN-FS on 784-dimensional data consisting of vectorized images drawn from (a) the entire MNIST dataset; and (b) only the Digit 4 class of MNIST. WGAN-FS fails to converge when trained on the complex MNIST multimodal data due to errors in estimating the full distribution with a truncated Fourier series and a small batch size. While the performance is better in the case of single-class learning, the images are sub-par compared to the baseline GANs.

E.3 Image-space Matching with WGAN-FS

In an n-dimensional setting, the Fourier-series approximation, and thereby, the WGAN-FS approach require that the underlying distributions are at least n

2 -times continuously differentiable for the truncation error derived in Appendix D.2 to be ﬁnite. However, it is widely accepted that image datasets lie in unions of low-dimensional manifolds in a higher dimensional space (Lui et al., 2017) resulting in a multimodal pd. This can lead to poorer estimates of the Fourier-series coeﬃcients of pd and pg, and the WGAN-FS generator will not learn the data distribution accurately. To validate this, we trained the WGAN-FS model on vectorized MNIST data, with x R784. The generator consists of a 3-layer fully connected network with 128, 256, and 512 nodes, in successive layers with the hyperbolic tan activation. The output layer consists of 784 nodes with a sigmoid activation and the input to the network is a 100-dimensional Gaussian vector. The fundamental period of the Fourier series is set to 2. The sampling scheme described in Appendix E.2 is used, with Mlow = 2, Mhigh = 10 and L = 1000. The model is trained with the Adam (Kingma and Ba, 2015) optimizer with a learning rate of 10 3 and a batch size of 250.

Figure 27(a) shows the images generated by WGAN-FS when trained with the entire MNIST dataset. We observe that the model has failed to learn the data distribution accurately. Instead, it learns only the average statistics (such as the mean digit) of the dataset. Figure 27(b) shows that the model performs better on a single-class learning task, where the data distribution is more structured. However, the visual quality of the generated images is below par than those generated by the baseline GAN variants (Arjovsky et al., 2017; Gulrajani et al., 2017; Terj ek, 2020). We attribute this poor performance to inaccuracies in estimating the Fourier-series coeﬃcients from a small batch size and insuﬃcient harmonics, given that the ambient dimension of the data is 784. We also note that the WGAN-FS model did not converge when trained on 3072-dimensional CIFAR-10 or SVHN images or other high-resolution datasets. These results further motivate the need for training the Fourier-series model on latent space representations of the data.

Euler-Lagrange Analysis of GANs

E.4 Additional Details on Evaluation Metrics

In this appendix, we provide additional details on the evaluation metrics used in Sections 3.7, 4.3, and 6 of the main manuscript:

Wasserstein-2 distance: For Gaussian generator and target distributions, the Wasserstein-2 distance has a closed-form expression:

W2,2 (N(µd, Σd), N(µg, Σg)) = µd µg 2 2 + Trace

Σd + Σg 2 Σ

Fr echet Inception Distance (FID): The FID was introduced by Heusel et al. (2017) to measure the visual quality of images generated by GANs. FID is highly correlated with human evaluation of such images. In this setting, we ﬁrst consider the Inception V3 model (Szegedy et al., 2015) without the topmost layer, loaded with pre-trained weights for Image Net (Deng et al., 2009) classiﬁcation, to generate embeddings of the real and generated data. Next, we assume that the embeddings of the real and generated data are distributed as N(µd, Σd) and N(µg, Σg), respectively, and compute FID as the Fr echet distance between them:

Fr (N(µd, Σd), N(µg, Σg)) = µd µg 2 2 + Trace Σd + Σg 2 (ΣdΣg) 1 2 .

The Inception V3 model accepts input dimensions in the range of 76 76 3 to 229 229 3. For consistency with the literature, color images are upscaled to 229 229 3 using bilinear interpolation. Grayscale images are upscaled to 229 229 and replicated across the color channels. The FID score is measured over batches of 104

images. The best-case FID scores of the converged models are measured using 5 104

samples drawn from both the target dataset and the WAE in all cases except Ukiyo-E and single-class CIFAR-10, where the entire target class (about 5 103 images) is used. We use the publicly available Tensor Flow implementation of clean-ﬁd (Parmar et al., 2021) to compute the metric. Our implementation of CWAE (Knop et al., 2020) and WAE (Tolstikhin et al., 2018) produced FID scores that are consistent with the literature on CIFAR-10 and Celeb A datasets. FID scores for MNIST, SVHN and Ukiyo-E datasets were not reported in the baselines.

Image Sharpness: We employ the metric proposed by Tolstikhin et al. (2018) to characterize image sharpness. The sharpness metric is measured on two sets of data: (i) Images obtained by decoding sample vectors drawn from the prior distribution; and (ii) Images obtained by decoding interpolated latent vectors. The test images are rescaled to have pixel intensities in [0, 1], and convolved with the Laplacian kernel

1 1 1 1 8 1 1 1 1

to emphasize edges. The variance of the pixel intensities in the

Laplacian of the image is evaluated and averaged over a batch of 103 images to measure sharpness. Blurred images possess fewer distinct edges thereby resulting in a lower variance than sharper images.

Asokan and Seelamantula

Appendix F. Other Gradient-Regularized GANs

In this appendix, we consider Wasserstein GAN with the gradient penalty (WGAN-GP) (Gulrajani et al., 2017) and the centered WGAN-Rd and WGAN-Rg gradient penalties (Mescheder et al., 2018) within the Euler-Lagrange framework. We also consider the SGAN and LSGAN subject to the considered gradient-norm penalty, resulting in SGAN-GNP and LSGAN-GNP, respectively.

WGAN-GP: Consider the WGAN-GP discriminator loss given by

LD = Ex pd[D(x)] + Ex pg[D(x)] + λ Ex αpg+(1 α)pd ( D(x) 2 1)2

D(x) (pg(x) pd(x)) + (pg(x) + (1 α)pd(x)) ( D(x) 2 1)2 dx.

Applying the EL condition for optima yields the following condition that the optimal discriminator D (x) must satisfy:

(αpg(x) + (1 α)pd(x)) 1 D(x) 1 D(x) + pd(x) pg(x)

+ (α pg(x), D(x) + (1 α) pd(x), D(x) ) 1 D(x) 1

+ (αpg(x) + (1 α)pd(x)) D(x) 3 ( D(x)).2, diag(HD) D=D = 0,

D = [D 1, D 2, . . . , D n]T, with D i = D

is the gradient vector, ( D).2 represents element-wise squaring, and

diag(HD) = [D 11, D 22, . . . , D nn]T, with D ii = 2D

is the vector formed by the diagonals of the Hessian matrix of D. The result is an intractable, non-linear, second-order diﬀerential equation.

WGAN-Rd Rg: The WGAN-Rd Rg loss presented by Mescheder et al. (2018) is:

LD = Ex pd[D(x)] + Ex pg[D(x)] + λ1

2 Ex pd D(x) 2 2 + λ2

2 Ex pg D(x) 2 2

D(x) (pg(x) pd(x)) + 1

2 (λ1pd(x) + λ2pg(x)) D(x) 2 2

where the EL condition for optimality results in the following diﬀerential equation:

D + λ1 pd + λ2 pg, D

λ1pd + λ2pg = pg pd λ1pd + λ2pg .

Unlike WGAN-GP, this formulation results in a second-order elliptic diﬀerential equation with a variable coeﬃcient. This is an instance of the Stein operator considered by Mroueh

Euler-Lagrange Analysis of GANs

et al. (2018), where the gradient penalty is evaluated with respect to the measure λ1pd+λ2pg. Fourier transform based techniques could be employed to solve the PDE.

SGAN-GNP: Based on a ﬁnding that the gradient penalty improves the performance of WGANs (Gulrajani et al., 2017), the same penalty was used to improve the performance of several f-GAN variants by Roth et al. (2017). The expectations are evaluated with respect to an interpolated distribution between pd(x) and pg(x) by Roth et al. (2017) and between pd(x) and the standard Gaussian by Kodali et al. (2017). We consider the application of the gradient-norm penalty introduced in the context of WGANs in this paper to SGAN and LSGAN. Consider the optimization of the SGAN discriminator in 1-D with the gradient-norm penalty:

min D(x) LSGAN D s.t. Z

|D (x)|2 1 dx = 0.

The integrand in the Lagrangian of the constrained discriminator loss is

F x, D, D = ln (D(x)) pd(x) + ln (1 D(x)) pg(x) + λd|D (x)|2.

Following the Euler-Lagrange condition, the optimal discriminator must satisfy the following diﬀerential equation:

D (x) = pd(x) (pd(x) + pg(x)) D(x)

2λd D(x)(1 D(x)) . (65)

A similar solution can be obtained in the multidimensional case as well, where the SGAN optimization problem becomes

min D(x) LSGAN D s.t. Z

D(x) 2 2 1 dx = 0.

The Euler-Lagrange condition gives the second-order non-linear PDE:

D(x) = pd(x) (pd(x) + pg(x)) D(x)

2λd D(x)(1 D(x)) . (66)

Equations (65) and (66) are not amenable to a closed-form solution. A practical alternative would be to use a numerical solver.

LSGAN-GNP: Finally, consider LSGAN with the gradient-norm penalty in 1-D:

min D(x) LLSGAN D s.t. Z

|D (x)|2 1 dx = 0.

The integrand in the Lagrangian of the constrained discriminator loss is

F x, D, D = (D(x) b)2pd + (D(x) a)2pg + λd|D (x)|2.

Asokan and Seelamantula

Applying the Euler-Lagrange condition gives

D (x) | {z } T1

+ pg(x) + pd(x)

D(x) | {z } T2

= apg(x) + bpd(x)

λd | {z } T3

The Fourier-series expansions of pg(x), pd(x), and D(x) as deﬁned in Section 3.4 simplify T1 and T3 readily, whereas T2 could be simpliﬁed using the convolution property:

n= χnejωonx, where χn =

ℓ= γℓ(αn ℓ+ βn ℓ).

The sequence {χn} is a sum of two convolutions, one between the sequences {αℓ} and {γℓ}, and the other between {βℓ} and {γℓ}. The sequence {γℓ} has to be determined in order to arrive at the discriminator. Simplifying (67) in view of the Fourier-series representations gives the following optimality conditions in terms of the Fourier coeﬃcients:

λdω2 on2γn aαn + bβn +

ℓ= γℓ(αn ℓ+ βn ℓ) = 0, n Z {0}, (68)

which is an inﬁnite system of linear equations. One approach may be to consider a truncated Fourier-series expansion, which would give rise to a ﬁnite system of linear equations that can be solved using iterative algorithms or the Moore-Penrose pseudo-inverse. Let us now consider the discriminator loss in n dimensions:

min D(x) LLSGAN D s.t. Z

D(x) 2 2 1 dx = 0.

The EL condition applied to the Lagrangian of the discriminator loss results in the following:

D(x) + pd(x) + pg(x)

D(x) = apg(x) + bpd(x)

The above PDE is of the form D(x) + β1(x)D(x) = β0(x). As in the 1-D case, one could consider Fourier-series expansions and take advantage of the multidimensional convolution property. However, the computational complexity would increase exponentially with n.

M. Abadi et al. Tensor Flow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv preprint, ar Xiv:1603.04467, Mar. 2016. URL https://arxiv.org/abs/ 1603.04467.

M. Abramowitz. Handbook of Mathematical Functions, with Formulas, Graphs, and Mathematical Tables. Dover Publications, Inc., 1974.

J. Adler and S. Lunz. Banach Wasserstein GAN. In Advances in Neural Information Processing Systems 31, pages 6754 6763. 2018.

Euler-Lagrange Analysis of GANs

D. An, Y. Guo, N. Lei, Z. Luo, S.-T. Yau, and X. Gu. AE-OT: A new generative model based on extended semi-discrete optimal transport. In Proceedings of the 8th International Conference on Learning Representations, 2020.

M. Arbel, D. Sutherland, M. Bi nkowski, and A. Gretton. On gradient regularizers for MMD GANs. In Advances in Neural Information Processing Systems 31, pages 6700 6710, 2018.

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.

H. Bateman and A. Erdelyi. Higher Transcendental Functions. Mc Graw-Hill, New York, 1953.

D. Berthelot, C. Raﬀel, A. Roy, and I. Goodfellow. Understanding and improving interpolation in autoencoders via an adversarial regularizer. In Proceedings of the 7th International Conference on Learning Representations, 2019.

O. Bousquet, S. Gelly, I. Tolstikhin, C. J. Simon-Gabriel, and B. Schoelkopf. From optimal transport to generative modeling: the VEGAN cookbook. ar Xiv preprint, ar Xiv:1705.07642, May 2017. URL https://arxiv.org/abs/1705.07642.

T. Che et al. Your GAN is secretly an energy-based model and you should use discriminator driven latent sampling. ar Xiv preprint, ar Xiv:2003.06060, June 2020. URL https: //arxiv.org/abs/2003.06060.

X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Info GAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems 29, pages 2180 2188, 2016.

R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeﬀrey, and D. E. Knuth. On the Lambert-W function. In Advances in Computational Mathematics, pages 329 359, 1996.

C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. In Proceedings of the 6th International Conference on Learning Representations, 2018.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009.

I. Deshpande, Z. Zhang, and A. Schwing. Generative modeling using the sliced Wasserstein distance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3483 3491, 2018.

L. C. Evans. Partial Diﬀerential Equations. American Mathematical Society, 2010.

W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohamed, and I. Goodfellow. Many paths to equilibrium: GANs do not need to decrease a divergence at every step. In Proceedings of the 6th International Conference on Learning Representations, 2018.

C. Feﬀerman. Pointwise convergence of Fourier series. Annals of Mathematics, 98(3):551 571, 1973. ISSN 0003486X. URL http://www.jstor.org/stable/1970917.

Asokan and Seelamantula

J. Ferguson. A brief survey of the history of the calculus of variations and its applications. ar Xiv preprint, ar Xiv:math/0402357, Feb. 2004. URL arxiv.org/abs/math/0402357.

Fermi-LAT Collaboration. A gamma-ray determination of the universe s star formation history. Science, 362:1031 1034, 2018.

A. Feuerverger and R. A. Mureika. The empirical characteristic function and its applications. The Annals of Statistics, 5(1):88 97, 1977.

C. Finn, P. F. Christiano, P. Abbeel, and S. Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. ar Xiv preprint, ar Xiv:1611.03852, Nov. 2016. URL https://arxiv.org/abs/1611.03852.

R. Flamary et al. POT: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http://jmlr.org/papers/v22/20-451.html.

J. B. J. Fourier. M emoire sur la propagation de la chaleur dans les corps solides. Pr esent e le 21 d ecembre 1807 a l Institut national - Nouveau Bulletin des sciences par la Soci et ephilomatique de Paris, pages 215 221, 1807.

J.-Y. Franceschi, E. de B ezenac, I. Ayed, M. Chen, S. Lamprier, and P. Gallinari. A neural tangent kernel perspective of GANs. ar Xiv preprint, ar Xiv:2106.05566, abs/2106.05566, June 2021. URL https://arxiv.org/abs/2106.05566.

J. Gao and H. Tembine. Bregman learning for generative adversarial networks. In Proceedings of the Chinese Control And Decision Conference, pages 82 89, 2018.

I. M. Gelfand and S. V. Fomin. Calculus of Variations. Prentice-Hall, 1964.

I. M. Gelfand and G. E. Shilov. Generalized Functions, Vol. 1: Properties and Operations. American Mathematical Society, 1958.

A. Genevay, G. Peyre, and M. Cuturi. Learning generative models with Sinkhorn divergences. In Proceedings of the Twenty-First International Conference on Artiﬁcial Intelligence and Statistics, volume 84, pages 1608 1617. PMLR, 2018.

C. Giardina and P. Chirlian. Bounds on the truncation error of periodic signals. IEEE Transactions on Circuit Theory, 19(2):206 207, 1972.

H. H. Goldstine. A History of the Calculus of Variations from the 17th Through the 19th Century. Springer, New York, 1980.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672 2680. 2014.

P. Grnarova, K. Y. Levy, A. Lucchi, T. Hofmann, and A. Krause. An online learning approach to generative adversarial networks. In Proceedings of the 6th International Conference on Learning Representations, 2018.

Euler-Lagrange Analysis of GANs

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems 30, pages 5767 5777. 2017.

M. E. Gurtin. Variational principles for linear elastodynamics. Archive for Rational Mechanics and Analysis, 16(1):34 50, 1964a.

M. E. Gurtin. Variational principles for linear initial-value problems. Quarterly of Applied Mathematics, 22(3):252 256, 1964b.

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems 30, volume 30, 2017.

G. E. Hinton and R. Zemel. Autoencoders, minimum description length and Helmholtz free energy. In Advances in Neural Information Processing Systems 6, volume 6. Morgan Kaufmann, 1994.

J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29, pages 4565 4573. 2016.

A. Jolicoeur-Martineau. The relativistic discriminator: A key element missing from standard GAN. In Proceedings of the 7th International Conference on Learning Representations, 2019.

J. L. Kelley. General Topology. Courier Dover Publications, Inc., 2017.

M. Khayatkhoei, M. K. Singh, and A. Elgammal. Disconnected manifold learning for generative adversarial networks. In Advances in Neural Information Processing Systems 32, volume 31, 2018.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015.

S. Knop, P. Spurek, J. Tabor, I. Podolak, M. Mazur, and S. Jastrz ebski. Cram er-Wold autoencoder. Journal of Machine Learning Research, 21(164):1 28, 2020. URL http: //jmlr.org/papers/v21/19-560.html.

N. Kodali, J. D. Abernethy, J. Hays, and Z. Kira. On convergence and stability of GANs. ar Xiv preprint, ar Xiv:1705.07215, May 2017. URL http://arxiv.org/abs/1705.07215.

S. Kolouri, P. E. Pope, C. E. Martin, and G. K. Rohde. Sliced Wasserstein auto-encoders. In Proceedings of the 7th International Conference on Learning Representations, 2019.

A. Krizhevsky. Learning multiple layers of features from tiny images. Master s thesis, University of Toronto, 2009. URL https://ci.nii.ac.jp/naid/20001706980/en/.

J. H. Lambert. Observationes variae in mathesin puram. Acta Helveticae physicomathematico-anatomico-botanico-medica, pages 128 168, 1758.

Asokan and Seelamantula

N. S. Landkof. Foundations of Modern Potential Theory. Springer-Verlag, Berlin, New York, 1972.

Y. Le Cun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

N. Lei, Y. Guo, D. An, X. Qi, Z. Luo, S. T. Yau, and X. Gu. Mode collapse and regularity of optimal transportation maps. ar Xiv preprints, ar Xiv:1902.02934, Feb. 2019. URL https://arxiv.org/abs/1902.02934.

C. L. Li, W. C. Chang, Y. Cheng, Y. Yang, and B. Poczos. MMD GAN: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems 30, pages 2203 2213. 2017.

J. Li, A. Madry, J. Peebles, and L. Schmidt. On the limitations of ﬁrst-order approximation in GAN dynamics. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 3005 3013, 2018.

T. Liang. How well generative adversarial networks learn distributions. Journal of Machine Learning Research, 22(228):1 41, 2021. URL jmlr.org/papers/v22/20-911.html.

H. Liu, Y. Guo, N. Lei, Z. Shu, S. T. Yau, D. Samaras, and X. Gu. Latent space optimal transport for generative models. ar Xiv preprint, ar Xiv:1809.05964, abs/1809.05964, Sep. 2018. URL https://arxiv.org/abs/1809.05964.

S. Liu, S. Bousquet, and K. Chaudhuri. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, volume 30, 2017.

Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision, 2015.

K. Y.-C. Lui, Y. Cao, M. Gazeau, and K. S. Zhang. Implicit manifold learning on generative adversarial networks. The International Conference on Machine Learning Workshop on Implicit Models, 2017.

A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. ar Xiv preprints, ar Xiv:1511.05644, Nov. 2015. URL https://arxiv.org/abs/1511. 05644.

X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In Proceedings of International Conference on Computer Vision, 2017.

L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 3481 3490, 2018.

M. Mesterton-Gibbons. A Primer on the Calculus of Variations and Optimal Control Theory. American Mathematical Society, 2009.

Euler-Lagrange Analysis of GANs

S. Mohamed and B. Lakshminarayanan. Learning in implicit generative models. ar Xiv preprints, ar Xiv:1610.03483, Oct. 2016. URL https://arxiv.org/abs/1610.03483.

Y. Mroueh and T. Sercu. Fisher GAN. In Advances in Neural Information Processing Systems 30, pages 2513 2523. 2017.

Y. Mroueh, C. Li, T. Sercu, A. Raj, and Y. Cheng. Sobolev GAN. In Proceedings of the 6th International Conference on Learning Representations, 2018.

Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems 29, pages 271 279. 2016.

C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695 718, 2017.

F. A. Oliehoek, R. Savani, J. Gallego, E. van der Pol, and R. Groß. Beyond local Nash equilibria for adversarial networks. In Artiﬁcial Intelligence, BNAIC, volume 1021, pages 73 89, 2019.

G. Parmar, R. Zhang, and J.-Y. Zhu. On buggy resizing libraries and surprising subtleties in FID calculation. ar Xiv preprint, ar Xiv:2104.11222, abs/2104.11222, April 2021. URL https://arxiv.org/abs/2104.11222.

H. Petzka, A. Fischer, and D. Lukovnikov. On the regularization of Wasserstein GANs. In Proceedings of the 6th International Conference on Learning Representations, 2018.

J. N. M. Pinkney and D. Adler. Resolution dependent GAN interpolation for controllable image synthesis between domains. ar Xiv preprint, ar Xiv:2010.05334, Oct. 2020. URL https://arxiv.org/abs/2010.05334.

C. Qin, Y. Wu, J. Springenberg, A. Brock, J. Donahue, T. Lillicrap, and P. Kohli. Training generative adversarial networks by solving ordinary diﬀerential equations. In Advances in Neural Information Processing Systems 34. 2020.

D. Randle, P. Protopapas, and D. Sondak. Unsupervised learning of solutions to diﬀerential equations with generative adversarial networks. ar Xiv preprint, ar Xiv:2007.11133, July 2020. URL https://arxiv.org/abs/2007.11133.

M. Riesz. L int egrale de Riemann-Liouville et le probl eme de Cauchy. Acta Mathematica, 81:1 222, 1949.

K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems 30, pages 2015 2025, 2017.

Asokan and Seelamantula

K. Roth, Y. Kilcher, and T. Hofmann. Adversarial training generalizes data-dependent spectral norm regularization. ar Xiv preprint, ar Xiv:1906.01527, June 2019. URL https: //arxiv.org/abs/1906.01527.

T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems 29, pages 2234 2242, 2016.

M. Sanjabi, J. Ba, M. Razaviyayn, and J. D. Lee. On the convergence and robustness of training GANs with regularized optimal transport. In Advances in Neural Information Processing Systems 31, pages 7091 7101. 2018.

J. Schmidhuber. Deep learning in neural networks: An overview. ar Xiv preprint, ar Xiv:1404.7828, abs/1404.7828, Aug. 2014. URL http://arxiv.org/abs/1404.7828.

V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein. Implicit neural representations with periodic activation functions. In Advances in Neural Information Processing Systems, volume 33, pages 7462 7473, 2020.

S. L. Sobolev. Applications of Functional Analysis in Mathematical Physics. American Mathematical Society, 1963.

E. M. Stein. Singular Integrals and Diﬀerentiability Properties of Functions. Princeton University Press, 1970.

C. Szegedy, V. Vanhoucke, S. Ioﬀe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. ar Xiv preprint, ar Xiv:1512.00567, Dec. 2015. URL https://arxiv.org/abs/1512.00567.

D. Terj ek. Adversarial Lipschitz regularization. In Proceedings of the 8th International Conference on Learning Representations, 2020.

I. O. Tolstikhin, O. Bousquet, S. Gelly, and B. Sch olkopf. Wasserstein auto-encoders. In Proceedings of the 6th International Conference on Learning Representations, 2018.

V. S. Vladimirov. Equations of Mathematical Physics. Mir Publishers, 1984.

D. Wang and Q. Liu. Learning to draw samples: with application to amortized MLE for generative adversarial learning. ar Xiv preprint, ar Xiv:1611.01722, Nov. 2016. URL https://arxiv.org/abs/1611.01722.

L. Yang et al. Highly-scalable, physics-informed GANs for learning solutions of stochastic PDEs. In Proceedings of the IEEE/ACM Third Workshop on Deep Learning on Supercomputers, pages 1 11, 2019.

J. J. Zhao, M. Mathieu, and Y. Le Cun. Energy-based generative adversarial networks. In Proceedings of the 5th International Conference on Learning Representations, 2017.

J-Y Zhu, T Park, P Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of International Conference on Computer Vision, 2017.