# generative_adversarial_networks_dynamics__34282c7b.pdf

Journal of Machine Learning Research 26 (2025) 1-30 Submitted 6/24; Revised 5/25; Published 6/25

Generative Adversarial Networks: Dynamics

Matias G. Delgadino matias.delgadino@math.utexas.edu Department of Mathematics University of Texas at Austin Austin, TX 78712, USA

Bruno B. Suassuna bruno.b.suassuna@mat.puc-rio.br Departamento de Matematica Pontif ıcia Universidade Cat olica do Rio de Janeiro Rio De Janeiro, RJ 22451-900, Brazil

Rene Cabrera rene.cabrera@math.utexas.edu Department of Mathematics University of Texas at Austin Austin, TX 78712, USA

Editor: Quanquan Gu

We study quantitatively the overparametrization limit of the original Wasserstein-GAN algorithm. Eﬀectively, we show that the algorithm is a stochastic discretization of a system of continuity equations for the parameter distributions of the generator and discriminator. We show that parameter clipping to satisfy the Lipschitz condition in the algorithm induces a discontinuous vector ﬁeld in the mean ﬁeld dynamics, which gives rise to blow-up in ﬁnite time of the mean ﬁeld dynamics. We look into a speciﬁc toy example that shows that all solutions to the mean ﬁeld equations converge in the long time limit to time periodic solutions, this helps explain the failure to converge of the algorithm.

Keywords: GAN, Aggregation Equation, blow-up

1. Introduction

Generative algorithms are at the forefront of the machine learning revolution we are currently experiencing. Some of the most famous types are diﬀusion models Sohl-Dickstein et al. (2015), generative language models Radford et al. (2018) and Generative Adversarial Networks (GAN) Goodfellow et al. (2014). GAN was one of the ﬁrst algorithms to successfully produce synthetically realistic images and audio and is the topic of this article. A guiding assumption for GAN is that the support of the distribution can be well approximated by a lower dimensional object. That is to say although P P(RK), we expect that the inherent correlations in data, like values of neighboring pixels in an image, drastically reduce the dimensionality of the problem. In broad terms, we expect that, in some non-speciﬁed sense, the eﬀective dimension of the support of P is less or equal than a latent dimension L K. The GAN algorithm tries to ﬁnd an easy way to evaluate a continuous function from G : RL RK, which we call the generator. The objective is to make G(Z) to be approximately distributed like P , where Z is distributed like the standard Gaussian N(0, 1) P(RL). To get an idea of orders of magnitude, Karras et al. (2017) creates realistic looking high resolution images of faces with K = 1024 1024 3 = 3145728 and L = 512.

c 2025 Matias G. Delgadino, Bruno B. Suassuna and Rene Cabrera.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0848.html.

As the word adversarial in its name suggests, the algorithm pits two Neural Networks against each other, the generator network G and the discriminator network D. The discriminator network tries to discern from the synthetic samples G(Z) and the real samples X P . For this purpose, the optimization over the discriminator network D is the dual formulation of a metric between the associated synthetic data distribution G#N and the real data distribution P . The original algorithm Goodfellow et al. (2014) used Jensen-Shannon divergence. The version we analyze in detail here is the Wasserstein-GAN (WGAN) Arjovsky et al. (2017) which uses the 1-Wasserstein distance instead. The behavior of GAN is known to be directly tied to the choice of the metric, see Section 3 for more details. The architecture of the Neural Networks (NN) which parametrize the generator and discriminator also plays a large role in the success of the algorithms. The paradigm for architectures at the time of the ﬁrst prototypes of GANs was to use Convolutional Neural Networks (CNNs) which exploit the natural spatial correlations of pixels, see for example Alex Net introduced in Krizhevsky et al. (2017). Currently, the paradigm has changed with the advent of attention networks which are more parallelizable and outperform CNNs in most benchmarks, see Vaswani et al. (2017). In this paper, we forego the interesting question of the role of NN architecture to understand in more detail the induced dynamics, see Section 2.1 for more details. To understand the dynamics, we will follow the success of understanding the overparametrized limit in the supervised learning problem for shallow one hidden layer NN architectures Mei et al. (2018); Chizat and Bach (2018); Rotskoﬀand Vanden-Eijnden (2022), see also Fern andez-Real and Figalli (2022); Wojtowytsch and E (2020) for reviews of these results. In a nutshell, to the ﬁrst order these articles relate Stochastic Gradient Descent (SGD) parameter training to a stochastic discretization of an associated aggregation equation Bertozzi et al. (2011); Carrillo et al. (2011), and to a second order to an aggregation diﬀusion equation Carrillo et al. (2006). In probabilistic terms, this is akin to the law of large numbers Sirignano and Spiliopoulos (2020a) and the central limit theorem Sirignano and Spiliopoulos (2020b). We note the related work of Franceschi et al. (2022) which analyzes the training dynamics of GANs in the context of the Neural Tangent Kernel (NTK) regime, introduced in Jacot et al. (2018). Under the NTK scaling, essentially the parameters do not evolve in time, which implies the training dynamics of the discriminator and generator can be approximated by a linear system. This paper, centers on the mean ﬁeld scaling discussed above, which in the limit of inﬁnite width, the parameters do evolve in time. Our contribution, which is novel even in the supervised learning case, is to quantify this type of analysis. First, we show a quantitative result for the stability of the limiting aggregation equation in the 2-Wasserstein metric, see Theorem 6. The diﬃculty of the stability in our case is not the regularity of the activation function Chizat and Bach (2018), but instead the growth of the Lipschitz constant with respect to the size of the parameters themselves. Next, we show a quantitative convergence of the empirical process to the solution to the mean ﬁeld PDE, to our knowledge this is the ﬁrst of its kind in terms of a strong metric like the 2-Wasserstein metric, see Theorem 7 and Corollary 9. We refer to Zhou et al. (2019) for a discussion on the empirical usefulness of enforcing a Lipschitz condition of the discriminator in WGAN. We note the related work of Franceschi et al. (2022) which analyzes the training dynamics of GANs in the context of the Neural Tangent Kernel (NTK) regime, introduced in Jacot et al. (2018). Under the NTK scaling, essentially the parameters do not evolve in time, which implies the training dynamics of the discriminator and generator can be approximated by a linear system. This paper, centers on the mean ﬁeld scaling, which in the limit of inﬁnite width, the parameters evolve in time. We also mention the work of Balaji et al., which studies the training dynamics of GANs in the case

GAN: Dynamics

of a single layer generator and trivially trainable linear discriminator, and show convergence to a global saddle point. Moreover, the WGAN algorithm clips the discriminator parameters after every training iteration. In the follow up work Gulrajani et al. (2017) observed numerically that it created undesirable behavior. We refer to Zhou et al. (2019) for a discussion on the empirical usefulness of enforcing a Lipschitz condition of the discriminator in WGAN. In terms of the mean ﬁeld PDE (7), the clipping of parameters induces an associated discontinuous vector ﬁeld. This explains from a mathematical viewpoint the pathology mentioned before. In a nutshell, the parameter distribution can blow-up in ﬁnite time, and after that time the discriminator network loses the universal approximation capabilities, see Secion 2.4. Failure to converge is a known problem of GAN. For instance, Karras et al. (2017) introduces a progressive approach to training higher and higher resolution pictures, eﬀectively having hot start of the algorithm at every step. By looking at an enlightening simpliﬁed example, we can explicitly understand the long time behavior of the algorithm. In this example, any initialization eventually settles to a time periodic orbit, which implies that the generator oscillates forever, see Section 3. We mention the work Kodali et al. (2018), which relates undesirable behavior of the training dynamics to the presence of local Nash equilibria. Also, Mescheder et al. (2018) studies the oscillatory behavior of the training dynamics in diﬀerent GAN algorithms.

1.1 Outline of the paper

The rest of the paper is organized as follows. Section 2 contains the notation and the main results: the well posedness of the mean ﬁeld PDE system (7) Theorem 6, and the quantiﬁed mean ﬁeld convergence Theorem 7. Section 3 contains an enlightening example of the dynamics of WGAN. Section 4 contains the proof of Theorem 6. Section 5 contains the proof of Theorem 7. Section 6 presents the conclusions and discusses some future directions for research. Appendix A recalls well posedness and approximation of diﬀerential inclusions.

2. Set up and main results

We consider a cloud of data points {xi}i I RK, which we assume to be generated by an underlying probability measure P P(RK). Although we do not have direct access to P , we assume the cloud of data is large enough so that we can readily sample xi P without any inherent bias. The task is to generate approximate samples of the distribution P from a base underlying probability measure which is easy to sample. We consider the Gaussian distribution N(0, 1) P(RL) in a latent space RL, where L is the dimension of the latent space, which is to be chosen by the user. We will try to approximate P by the push forward of said base distribution GΘ#N, where GΘ : RL RK is a parametric function, which is usually chosen to be a Neural Network. To choose the parameters Θ, whose dimensionality we will set later, we consider the following optimization problem inf Θ d1(GΘ#N, P ),

where d1 is the 1-Wasserstein distance. Although this problem seems rather straight forward, the Wasserstein distance is notorious for being diﬃcult to calculate in high dimensions, and we do not have direct access to P ; hence, in practice a proxy of said distance is chosen. More speciﬁcally, we approximate the dual problem

d1(GΘ#N, P ) = sup D Lip1

RL D(GΘ(z)) d N(z) Z

RK D(x) d P (x), (1)

by replacing the Lip1 class of functions by the parametric function DΩ: RK R,

d1(GΘ#N, P ) sup Ω

RL DΩ(GΘ(z)) d N(z) Z

RK DΩ(x) d P (x).

The parametric function DΩwill also be considered as a Neural Network and the parameters Ωare restricted to a compact convex set. The precise deﬁnition of GΘ and DΩas Neural Networks with a single hidden layer is given bellow, letting σ : R R denote the activation function. Since the parameters Ωare restricted to a compact set, if σ is C1 bounded the family {DΩ} is uniformly Lipschitz.

Remark 1 The original GAN Goodfellow et al. (2014) utilizes the Jensen-Shannon divergence, which in terms of Legendre-Fenchel dual can be written as

JS(GΘ#N, P ) = sup D Cb(RK)

RL log D(GΘ(z)) d N(z) + Z

RK log(1 D(x)) d P (x).

2.1 Neural Networks

For both the generator GΘ and discriminator DΩ, we consider the simple case of a single hidden layer, which has the universal approximation property, see Cybenko (1989). That is to say

i=1 α1 i σ(β1 i z + γ1 i ), ... , 1

i=1 αK i σ(βK i z + γK i )

where the array Θ = (θ1, , θN) ((R RL R)K)N is given by θi = (αj i, βj i , γj i )1 j K, and DΩ deﬁned by

i=1 aiσ(bi x + ci),

where the array Ω= (ω1, , ωM) (R RK R)M is given by ωi = (ai, bi, ci). To obtain rigorous quantitative estimates, throughout the paper we consider activation functions σ : R R that are in C2(R) and with bounded derivatives up to second order. The typical example being the sigmoid function

σ(u) = 1 1 + e u .

Simplifying notation, we denote

αjσ(βj z + γj) = σ(z; θj) with θj = (αj, βj, γj) R RL R, 1 j K,

aσ(b x + c) = σ(x; ω) with ω = (a, b, c) R RK R.

Remark 2 The mean ﬁeld analysis of two hidden layers NN is also possible, see for instance Sirignano and Spiliopoulos (2022).

GAN: Dynamics

2.2 Training the parameters by SGD

We follow a simpliﬁed version of parameter training algorithm which is given in the original reference Arjovsky et al. (2017), the only diﬀerence is that for comprehensibility we consider stochastic gradient descent instead of RMSProp (Tieleman (2012)), see Remark 5. We use n as the full step indexing, and l for the sub-index related to the extra training for the Discriminator s parameters. We initialize the parameters chaotically:

Ω1,1 ν M in and Θ1 µ N in ,

where Ω1,1 (R RK R)M and Θ1 ((R RL R)K)N,

and the initial distributions

νin P R RK R and µin P (R RL R)K

are ﬁxed independent of N and M. Of course, correlations in parameter initialization and N and M dependent initial conditions can be introduced if they were desirable. Iteratively in n until convergence, and iteratively for l = 2, ..., nc with nc a user deﬁned parameter, we deﬁne

Ωn,l = clip Ωn,l 1 + h Ω(DΩn,l 1(GΘn(zn l )) DΩn,l 1(xn l )) ,

Ωn+1,1 = Ωn,nc,

and Θn+1 = Θn h ΘDΩn+1,1(GΘn(zn nc+1)),

where the function clip stands for the projection onto [ 1, 1] [ 1, 1]K [ 1, 1], and h > 0 is the learning rate which is a user chosen parameter. The families {xn l }n N,l {1,...,nc}, {zn l }n N,l {1,...,nc+1} are independent RK and RL valued random variables distributed by P and N, respectively.

Remark 3 The clipping of the parameter is made to ensure that the discriminator network is uniformly bounded in the Lipschitz norm, to approximate Kantorovich s duality formulation (1). With this in mind, we should notice that the clipping of all parameter is slightly indiscriminate. For instance, the dependence of the discriminator function with respect to the parameter c is bounded by our assumption on the activation function σ, and would not need to be clipped.

Remark 4 In practice, deeper and more complex networks are utilized. The choice of analyzing 2layer Neural Nets stems from the mathematical tractability of them. Still, we claim without a proof that in the case the activation function is Re LU, the produced networks are dense in 1-Lipschitz functions with the uniform topology. We note that this is accounting for the adjustment c not being unbounded like in Remark 3. The case analyzed here of sigmoid activation function is not dense in 1-Lipschitz functions. Nevertheless, these networks metricize weak convergence of measures, with an alternative even weaker metric than the 1-Wasserstein distance. This follows by a careful re-application of Cybenko s Theorem Cybenko (1989).

Remark 5 We should note that other versions of SGD like Adam or RMSProp (see Kingma and Ba (2014) and Tieleman (2012)) are preferred by users as they are considered to outperform SGD. They introduce adaptive time stepping and momentum in an eﬀort to avoid metastability of plateaus, and falling into shallow local minima. These tweaks of SGD add another layer of complexity which we will not analyze in this paper.

2.3 Associated Measures

Associated to each family of parameters at the iteration step n we consider the empirical measures,

i=1 δΘn i P (R RL R)K

i=1 δΩn,1 i P R RK R .

Abusing notation slightly and for general probability measures µ P (R RL R)K and ν P R RK R , deﬁne

R RL R σ(z; θ1) dµ1(θ1), ... , Z

R RL R σ(z; θK) dµK(θK) (2)

and Dν(x) = Z

R RK R σ(x; ω)dν(ω), (3)

where µi, for i = 1, ..., K, denotes the i-th marginal of µ. We should note that due to the exchangeability of the parameters, there is no loss of information from considering the pair (Θn, Ωn) versus the pair (µn N, νn M). In fact, using the previous notations we have

GΘn = Gµn N and DΩn = Dνn M .

Hence, to understand the behavior of the algorithm in the overparameterization limit, we will center our attention on the evolution of the empirical measures. More speciﬁcally, we consider the curves µ C [0, ); P (R RL R)K and ν C [0, ); P R RK R to be, respectively, the linear interpolation of µn N and νn M at the time values tn = n(h/N). The choice of the scale t = h/N is arbitrary, and could also be expressed in terms of M. The relationship between N, M and nc gives rise to diﬀerent mean ﬁeld limits

and we will obtain diﬀerent behavior in terms of limiting dynamics. In this paper, we address the intermediate limit γc 1, but we should notice that in practice it is also interesting to study when γc = , which assumes that the discriminator has been trained to convergence, see Section 3 for an illustrative example. For notational simplicity, we write the proof for N = M and nc = 1, but our methods are valid for any ﬁnite value of γc 1. Explicitly, for any t [0, ), we ﬁnd n N and s [0, 1) such that

(1 s)tn + stn+1 = t

and set the intermediate value as the 2-Wasserstein geodesics:

i=1 δ(1 s)θn i +sθn+1 i and νN(t) = 1

i=1 δ(1 s)ωn,1 i +sωn+1,1 i . (5)

GAN: Dynamics

2.4 Identifying the limit

For a given pair of measures µ and ν, consider the energy functional:

E[µ, ν] = Z

RL Dν(Gµ(z)) d N(z) Z

RK Dν(x) d P (x). (6)

The evolution of the limit can be characterized by the gradient descent of E on µ and gradient ascent on ν, the latter restricted to P([ 1, 1] [ 1, 1]K [ 1, 1]). In terms of equations we consider

tµ θ µ θ δE

δµ [µ, ν] = 0,

tν + γc ω ν ProjπQ ω δE

δν [µ, ν] = 0,

µ(0) = µin, ν(0) = νin,

where we deﬁne Q = [ 1, 1] [ 1, 1]K [ 1, 1] and the ﬁrst variations are

δµ [µ, ν](θ) = Z

Q 1σ(Gµ(z); ω) (σ(z; θ1), ... , σ(z; θK)) dν(ω) d N(z),

δν [µ, ν](ω) = Z

RL σ(Gµ(z); ω) d N(z) Z

RK σ(x; ω) d P (x)

and ProjπQ : Q (R RK R) R RK R is the projection onto the tangent cone πQ(ω). In the present case the projection can be deﬁned by components as follows:

ProjπQ(ω, V )l =

( Vl, ωl ( 1, 1) Vl 1 sign(Vlωl)

2 , ωl { 1, 1} . (8)

We should notice in fact that the projection is trivial away from the boundary, or if the vector ﬁeld at the boundary points into the domain. Eﬀectively, the projection does not allow for mass to exit the domain. We do note that this can easily make mass collapse onto the boundary and ﬂatten the support of the distribution ν into less dimensions, see Section 3 for a further discussion. In the context of ODEs, the projection onto convex sets was considered by Henry (1973), which we recall and expand on Appendix A. For Hilbert spaces setting, we mention the more general sweeping processes introduced by Mureau Moreau (1977). Recently, projections of solutions to the continuity equation onto semi-convex subsets have been considered as models of pedestrian dynamics with density constraints, see for instance Di Marino et al. (2016); Santambrogio (2018); De Philippis et al. (2016).

2.5 Main Result

We start by showing that the mean ﬁeld parameter dynamics with a discontinuous vector ﬁelds are well deﬁned and stable. We quantify all the results with respect to the Wasserstein distance, with d2 and d4 representing the standard 2-Wasserstein and 4-Wasserstein distance, respectively.

Theorem 6 Given initial conditions (µin, νin) P (R RL R)K P(Q) such that for some δ > 0 Z eδ|α|2 dµin < , (9)

there exists a unique absolutely continuous weak solution to the mean ﬁeld system (7).

Moreover, we have the following stability estimate: For any T [0, ), there exists C > 1 such that sup t [0,T] d2((µ1(t), ν1(t)), (µ2(t), ν2(t))) Cd2 4((µ1,in, ν1,in), (µ2,in, ν2,in)), (10)

for any pair of weak solutions (µ1, ν1) and (µ2, ν2).

The proof of Theorem 6 is given in Section 4, see Proposition 11 for a precise dependence of the constants. Our main result is the following estimate on the continuous time approximation of parameter dynamics.

Theorem 7 Let (µN(t), νN(t)) be the empirical measures associated to the continuous time interpolation of the parameter values, assumed to be initialized by independent samples from (µin, νin) given by (5). Consider (ˆµN(t), ˆνN(t)) the unique solution to the PDE (7) with random initial conditions (µN(0), νN(0)). If µin has bounded double exponential moments on α, that is to say for some δ > 0

eeδ|α|2 < , (11)

then for any ﬁxed time horizon T [0, ) there exists C > 0 such that

sup t [0,T] Ed2 2((µN(t), νN(t)), (ˆµN(t), ˆνN(t))) C

Remark 8 The need for (11) stems from the linear dependence of the Lipschitz constant of the mean ﬁeld vector ﬁeld with respect to the size of the parameters, see Lemma 14.

The proof of Theorem 7 is presented in Section 5. Using the convergence Theorem 7 and the stability of the mean ﬁeld Theorem 6, we can obtain a convergence rate estimate which suﬀers the curse of dimensionality.

Corollary 9 Under the hypotheses of Theorem 6 and Theorem 7, for any ﬁxed T > 0, there exists C > 0 such that

max t [0,T] Ed2 2((µ(t), ν(t)), (µN(t), νN(t))) C

2 K(L+2) (13)

where (µ, ν) is the unique solution of (7) and (µN, νN) is the curve of interpolated empirical measures associated to the parameter training (5).

Remark 10 We should note that the diﬀerence between the results of Theorem 7 and Corollary 9 is that the dependence in the number of neurons N in estimate (12) does not suﬀer from the curse of dimensionality, while the stronger estimate (13) does. It is worth mentioning that, there is a mild dependence of dimension in (12) through the q-th moment Mq of the initial conditions µin νin. The later dependence on dimension is typical and sharp for the approximation of the Wasserstein distance with sampled empirical measures, see Dudley (1978); Fournier and Guillin (2015); Bolley et al. (2007). This stiﬀdependence on dimension suggests that studying the long time behavior of the mean ﬁeld dynamics of smooth initial data (µ(t), ν(t)) is not necessarily applicable in practice. Instead, the focus should be to show that with high probability that discrete mean ﬁeld trajectories (ˆµN(t), ˆνN(t)) converge to a desirable saddle point of the dynamics. See Section 3 for an explicit example of long time behavior.

GAN: Dynamics

Proof [Proof of Corollary 9] We consider the auxiliary pair of random measure-valued paths (ˆµN, ˆνN) which are a solution to (7) with stochastic initial conditions (µN(0), νN(0)), that is

ˆµN(0) = µN(0) = 1

i=1 δθi,in and ˆνN(0) = νN(0) = 1

i=1 δωi,in,

where θi,in and ωi,in are N independent samples from µin and νin, respectively. By the large deviation estimate in Fournier and Guillin (2015), for q large enough we have

E[d4 4((ˆµN(0), ˆνN(0)), (µin, νin))] CM

4 K(L+2) + 1

where Mq denotes the q-th moment of µin νin. By Theorem 6, taking q large enough, and using that µin νin has ﬁnite moments of all orders we have

E[d2 2((ˆµN(t), ˆνN(t)), (µ(t), ν(t))] CE[d2 4((ˆµN(0), ˆνN(0)), (µ(0), ν(0)))] C

By the triangle inequality,

d2((µ(t), ν(t)), (µN(t), νN(t))) d2((µN(t), νN(t)), (ˆµN(t), ˆνN(t)) + d2((ˆµN(t), ˆνN(t)), (µ(t), ν(t)),

so the result follows by the previous estimate and Theorem 7.

3. Mode Collapse and Oscillatory Behavior

A standard problem of GANs is known as mode collapse, Srivastava et al. (2017); Metz et al. (2016); Thanh-Tung and Tran (2020). This can be broadly described as the generator outputting only a small subset of the types of clusters that are present in the original distribution. Although the generator outputs a convincing sample if considered individually, the overall distribution of samples is oﬀ. An extreme example is when the generator outputs almost identical samples for any value of the latent variable z. An explanation of this behavior for the original GAN algorithm is the use of Jensen-Shannon divergence JS(G#N, P ), see Remark 1. More speciﬁcally, if the measures are mutually singular G#N P , then JS(G#N, P ) = log(2) independently of how close the supports are to each other. Namely, the gradient of the associated loss function vanishes and there are no local incentives for the generator to keep learning. As we are not expecting for the support of these measures to be absolutely continuous, in fact we are postulating that in some sense the dimension of the support of G#N is smaller than L K, this case is more likely to be normal than the exception. The W-GAN Arjovsky et al. (2017) and its improved variant Gulrajani et al. (2017) try to ﬁx this by considering 1-Wasserstein distance instead which does not suﬀer from the vanishing gradient problem. Still, training the Generator to get useful outputs is not an easy task, it requires a lot of computation time and more often than not it fails to converge. For instance to produce realistic looking images Karras et al. (2017) took 32 days of GPU compute time, and the networks are trained on progressively higher and higher resolution images to help with convergence.

3.1 An explicit example

Consider a bimodal distribution as the toy example:

We consider the simplest network that can approximate this measures perfectly. We consider the generator, depending on a single parameter g R to be given by

( 1 x < g 1 x > g.

Although, this generator architecture seems far from our assumptions 2.1. This type of discontinuity arises naturally as a limit when the parameters go to inﬁnity. Namely, if we take b, c in such a way that c/b g R, then

( 0 x < g 1 x > g,

where σ is the sigmoid. The generator G can then be recovered as a linear combination of two such limits. The generated distribution is given by

Gg#P = Φ(g)δ 1 + (1 Φ(g))δ1,

where Φ(g) = P({z < g}) is the cumulative distribution function of the prior distribution P P(R), which we can choose. We make the choice of the cumulative distribution

Φ(g) = 1 1 + e g for g R

to simplify the calculations. Under this choice for g = 0, we have that Gg#P = P , hence the network can approximate the target measure perfectly. Moreover, we can explicitly compute the 1-Wasserstein distance,

d1(Gg#P, P ) = 1 2 Φ(g) ,

see the ﬁgure below.

6 4 2 0 2 4 6

d1(Gg#P, P )

GAN: Dynamics

We can clearly see that this function has a unique minimum at g = 0, and also that this function is concave in g away from g = g . The concavity of the functional makes the problem more challenging from the theoretical perspective and it will explain the oscillatory behavior of the algorithm close to the minimizer g . For the discriminator, we consider a Re LU activation given by

D(x; ω) = (ωx)+

with ω [ 1, 1]. We note that taking a single parameter, instead of a distribution, for the discriminator is supported by the mean ﬁeld dynamics (7). In the sense that under a bad initialization of parameters, the parameters of the discriminator can blow up in ﬁnite time to ν = δω(t). We consider the joint dependence function

Ψ(ω, g) = Z

R Dω(Gg(z)) d P(z) Z

R Dω(x) d P (x)

= Φ(g)( ω)+ + (1 Φ(g))(ω)+ 1

2 2 1 0 1 2

Ignoring, for now, the projection onto ω [ 1, 1], we have the dynamics

g(t) = gΨ[g, ω] = 1 2 e g

(1 + e g)2 ω

ω(t) = γc ωΨ[g, ω] = γc

2 e g 1 1 + e g ,

where γc is the critics speed up (4). These dynamics can be integrated perfectly, to obtain that

Eγc(ω(t), g(t)) = 2 cosh(g(t)) + |ω(t)|2

γc = 2 cosh(gin) + |ωin|2

γc = Eγc(ωin, gin).

3 2 1 0 1 2 3

Contours of γc = 1

3 2 1 0 1 2 3

Contours of γc = 10

Eγc = 2.1 Eγc = 2.5 Eγc = 3 Eγc = 4 Eγc = 5 Eγc = 10 |ω| = 1

In the ﬁgure above, we plot the level sets of Eγc as well as the restriction of |ω| 1. We notice that, given the value of γc there exists a unique level set

E (γc) = 2 + 1

such that the level set {Eγc = E } is tangent to the restriction |ω| 1. Now, we consider the dynamics with the restriction |ω| 1. We notice that for any initial conditions (ωin, gin) satisfying Eγc(ωin, gin) E (γc) the trajectory of parameters is unaﬀected by the restriction |ω| 1 and it is time periodic. On the other hand, if we consider initial conditions (ωin, gin) satisfying Eγc(ωin, gin) > E (γc) and |ωin| 1, the trajectory will follow the unconstrained dynamics until it hits the boundary of the restriction ω(t) Q = {|ω| = 1}. Then it follows on the boundary ω(t) Q = {|ω| = 1} until it reaches the point (ω(t ), g(t )) = ( 1, 0) on the tangential level set {Eγc = E } and start following this trajectory becoming time periodic. Hence, there exists t = t(Eγc(ωin, gin)) large enough, such that the trajectory (ω(t), g(t)) {Eγc(ωin, gin) = E } for t > t . Therefore, we can conclude that

|g(t)| cosh 1 1 + 1

Looking back at the ﬁgure, we can see that for γc = 1 that the limiting trajectory is {E1 = 3}, and that the generator parameter oscillates in the range |g(t)| 0.96 for t > t . While for γc = 10, we obtain that the limiting trajectory is {E10 = 2.1} and the limiting oscillations are smaller |g(t)| 0.31 for t > t . We do notice that regardless of the parameter γc and the initial conﬁguration, the limiting trajectory is always periodic in time. In fact, we expect that every trajectory of the mean ﬁeld dynamics settles into a periodic solution.

4. Properties of the mean ﬁeld

One of the main theoretical obstructions to understand the well-posedness of this ﬂow is that the projection operator ProjπQ induces a discontinuous vector ﬁeld in (7). Nevertheless, the convexity of the domain Q = [ 1, 1] [ 1, 1]K [ 1, 1] can be leveraged to obtain a stability estimate.

GAN: Dynamics

Given a time dependent continuous vector ﬁeld V : [0, ) Q R RK R, its projection ProjπQVt is a Borel measurable vector ﬁeld which is square integrable in space and time for any ﬁnite time horizon T > 0 and curve of probability measures ν C([0, ), P(Q)), Z T

Q |ProjπQVt|2 dνt

Hence, as long as the underlying velocity ﬁeld inducing the motion is continuous, we can consider the notion of weak solution for the continuity equation given by (Ambrosio et al., 2005, Chapter 8). With this in mind, we ﬁrst notice the Lipschitz continuity properties of the vector ﬁelds that induce the motion (7). More speciﬁcally, we denote by

V Θ (µ,ν)(θ) = θ δE

δµ [µ, ν](θ) = EzvΘ (µ,ν)(θ, z) (14)

and V Ω (µ,ν)(ω) = ω δE

δν [µ, ν](ω) = Ez ExvΩ (µ,ν)(ω, z, x), (15)

where we deﬁne the vector ﬁelds

vΘ (µ,ν)(θ, z) = θ

[ 1,1]1+K+1 1σ(Gµ(z); ω) (σ(z; θ1), ... , σ(z; θK)) dν(ω) (16)

and vΩ (µ,ν)(ω, z, x) = ω[σ(Gµ(z); ω) σ(x; ω)]. (17)

In Lemma 14 below, we show that V Θ (µ,ν)(θ) and V Ω µ,ν(ω) are Lipschitz continuous with respect to the dependence of arguments θ, ω as well as the measure arguments (µ, ν). Notice that V Ωand vΩ

do not depend on ν, only on µ. By (Ambrosio et al., 2005, Theorem 8.2.1), any continuous solution to the continuity equation (7) is supported over solutions of the associated characteristic ﬁeld. Using the classical theory Henry (1973) for projected ODE ﬂows, we can show that the characteristic equations ( d

dt(θ, ω) = (V Θ (µ,ν)(θi), ProjπQ(ωi)V Ω (µ,ν)(ωi))

(θ, ω)(0) = (θin, ωin) (18)

have a unique solution. More speciﬁcally, an absolutely continuous curve (µ, ν) AC([0, ); P (RL+2)K P(Q)) is a weak solution to (7), if it is given as the image of the initial distributions (µin, νin) through the unique projected ODE ﬂow. That is to say,

(µ, ν)(t) = Φt (µ,ν)#(µin, νin) (19)

the family of continuous mappings Φt µ,ν : (R RL R)K Q (R RL R)K Q given by

Φt (µ,ν)(ωin, θin) = (ω(t), θ(t)), (20)

where (ω(t), θ(t)) is the unique Lipschitz solution to (18). One of the main technical hurdles is that the vector ﬁelds inducing the motion are only locally Lipschitz. The Lipschitz constant depends itself on the size of α, which is the ﬁrst variable of θ. Hence, to obtain a stability estimates we need to measure the distance of the initial condition in a p-Wasserstein distance with p > 2. The choice of p = 4 in the following result is arbitrary.

Proposition 11 (Stability) Assume (µ1, ν1), (µ2, ν2) AC [0, ); P (RL+2)K P(Q) are weak solutions to (7) which satisﬁes (19). Assume that the initial distribution has bounded exponential moments in the following sense: there exists δ > 0 such that Z eδ|α|2 dµ1,in, Z eδ|α|2 dµ2,in < .

Then for any t > 0 we have the bound

d2((µ1(t), ν1(t)), (µ2(t), ν2(t))) A(t)e B(t)d2 4((µ1,in, ν1,in), (µ2,in, ν2,in)), (21)

A(t) = e C(t2+tΛ) Z e Ct|α|dµ1,in + Z e Ct|α|dµ2,in

and B(t) = Ct A(t) (t + Λ) ,

Λ = 1 + Z |α|2dµ1,in

1/2 + Z |α|2dµ2,in

and C > 0 a constant that only depends on σ C2.

Remark 12 The double exponential growth on the estimate is related to the dependence Lipschitz constant of the vector ﬁeld with respect to the size of the parameters themselves, see Lemma 14 for the speciﬁc estimates.

For discrete initial conditions, existence to (7) follows from applying the results in Appendix A. Using stability, we can then approximate the initial condition by taking discrete approximations of it.

Proposition 13 (Existence) For any initial condition (µin, νin) P (RL+2)K P(Q) satisfying that there exists δ > 0 such that Z eδ|α|2 dµin < ,

there exists (µ, ν) AC [0, ); P (RL+2)K P(Q) a weak solution to (7) which satisﬁes the mild formulation (19).

Proof [Proof of Proposition 13] For any L N we consider a deterministic discretization

i=1 wiδθi L and νL in =

i=1 viδωi L

of the initial conditions µin, νin, where wi and vi are weights which add up to 1. The main properties we need from this discretization is that

lim L d4((µL in, νL in), (µin, νin)) = 0 and Z eδ|α|2 dµL in Z eδ|α|2 dµin.

Such a discretization can be given by the following procedure. For simplicity we consider R = 2k(L+2)K, we divide the box [ log R, log R](L+2)K into equal sized boxes {Bi}L i=1. We assign θi to be the the point with the smallest norm of the box Bi, and the weights are given by wi = µin(Bi).

GAN: Dynamics

We add any leftover mass on ([ log R, log R](L+2)K)c to the delta at the origin. We do the same to produce νL in. By Appendix A, for any L N there exists a unique solution to the projected ODE associated to the solution of the mean ﬁeld equations with initial conditions given by (µL in, νL in). Hence, we can construct a global weak solution to the PDE (µL(t), νL(t)). By the stability result, we know that for any ﬁnite time horizon T > 0, {(µL, νL)}L form a Cauchy sequence in AC([0, T], P (RL+2)K P(Q)). Hence, there exists (µ, ν) AC [0, ); P (RL+2)K P(Q) , such that for any ﬁxed time horizon T lim L sup t [0,T] d2 2((µ(t), ν(t)), (µL(t), νL(t)) = 0.

By Lemma 15, µL satisﬁes the growth condition (26), and so does µ. By Lemma 16, we have that the associated projected ODE ﬂows also converge

lim L sup t [0,T] |Φt (µL,νL)(θ, ω) Φt (µ,ν)( θ, ω)|2 e C(Λ+|α|+| α|)|(θ, ω) ( θ, ω)|2. (22)

Using that (µL(t), νL(t)) = Φt (µL,νL)#(µL inνL in)

and the uniform exponential integrability of (µL in, νL in), we can conclude that

(µ(t), ν(t)) = Φt (µ,ν)#(µin, νin),

which in turn implies that (µ, ν) is a weak solution to (7) satisfying (19).

For the next lemma we use the notation θ = (θ1, . . . , θK) with θi = (αi, βi, γi) R RL R, and α = (α1, . . . , αK) RK.

Lemma 14 There exists C R depending on σ C1 such that the vector ﬁelds (14), (15), (16) and (17) satisfy the bounds

V Ω (µ,ν) C 1 + Z |α|dµ , vΩ (µ,ν)( , z, x) C 1 + |x| + Z |α|dµ , (23)

( C for r = 1 C|αj| for r = 1, vΘ j

( C for r = 1 C|αj| (1 + |z|) for r = 1, (24)

where (vj)r denotes the r-th component of the j-th position. Moreover, we have the following Lipschitz estimate. There exists C R depending on σ C2, such that V Θ (µ1,ν1)(θ) V Θ (µ2,ν2)( θ) C(|α| + | α| + A(µ1, µ2)) d2((µ1, ν1), (µ2, ν2)) + |θ θ|

|V Ω (µ1,ν1)(ω1) V Ω (µ2,ν2)(ω2)| CA(µ1, µ2) |ω1 ω2| + d2(µ1, µ2) ,

and |vΘ (µ1,ν1)(θ, z) vΘ (µ2,ν2)( θ, z)| C (|α| + | α| + A(µ1, µ2) + |z|) d2((µ1, ν1), (µ2, ν2)) + |θ θ| ,

|vΩ (µ1,ν1)(ω1, z, x) vΩ (µ2,ν2)(ω2, z, x)| C (A(µ1, µ2) + |x| + |z|) |ω1 ω2| + d2(µ1, µ2) ,

A(µ1, µ2) = 1 + Z |α|2dµ1

1/2 + Z |α|2dµ2

and α1, α2 are the ﬁrst components of θ1, θ2, respectively.

Proof [Proof of Lemma 14] Throughout the proof, we use the notation θ = (θ1, ..., θK) with θi = (αi, βi, γi) R RL R, α = (α1, ..., αK) (RL)K, and ω = (a, b, c) Q. We begin by explicitly writing out the vector ﬁelds

vΩ (µ,ν)(ω, z, x) =

σ(b Gµ(z) + c) σ(b x + c) a Gµ(z)σ (b Gµ(z) + c) axσ (b x + c) aσ (b Gµ(z) + c) aσ (b x + c)

and vΘ (µ,ν)(θ, z) = vΘ (µ,ν);1(θ1, z), , vΘ (µ,ν);K(θK, z) with for 1 j K:

vΘ (µ,ν);j(θj, z) = Z

[ 1,1]1+K+1

abjσ(βj z + γj)σ(b Gµ(z) + c) abjαjzσ (βj z + γj)σ(b Gµ(z) + c) abjαjσ (βj z + γj)σ(b Gµ(z) + c)

Bounding the generator (2), we have

|Gµ(z)| Z |σ(z; θ)|dµ(θ) C Z |α|dµ(θ) . (25)

Using (25), and that |a|, |b|, |c| 1, we readily obtain (23) and (24). Applying the mean value theorem,

ωσ(x1; ω1) ωσ(x2; ω2)

σ (ξ0)[b1 x1 b2 x2 + c1 c2] a1x1 σ (ξ1)[(b1 x + c1) (b2 y + c2)] + (x1(a1 a2) + (x1 x2)a2)σ (b2 y + c2) a1σ (ξ1)[(b1 x1 + c1) (b2 x2 + c2)] + (a1 a2)σ (b2 x2 + c2)

where ξ0, ξ1 are points in between b1 x+c1 and b2 y+c2. To obtain the estimate for vΩ, we consider the diﬀerence above in two instances x1 = x2 = x, and taking x1 = Gµ1(z) and x2 = Gµ2(z). Using the triangle inequality, and σ C2 < , we can conclude

|vΩ (µ1,ν1)(ω1, z, x) vΩ (µ2,ν2)(ω2, z, x)|

C (1 + |x| + Gµ1(z) + Gµ2(z)) (|ω1 ω2| + |Gµ1(z) Gµ2(z)|) .

To estimate Gµ1(z) Gµ2(z), we consider π a coupling between µ1 and µ2, and notice that the diﬀerence is given by

|Gµ1(z) Gµ2(z)| =

Z σ(z; θ) σ(z; θ)dπ(θ, θ) Z |σ(z; θ) σ(z; θ)| dπ.

GAN: Dynamics

Estimating, |σ(z, θ) σ(z, θ)| C 1 + (|α| + | α|)(1 + |z|) |θ θ|.

Applying the Cauchy-Schwarz inequality,

|Gµ1(z) Gµ2(z)|2 C 1 + Z |α|2 dµ1 + Z | α|2 dµ2

(1 + |z|2) Z |θ θ|2 dπ.

Taking π to be the optimal coupling with respect to the d2 distance, and using (25), we conclude

|vΩ (µ1,ν1)(ω1, z) vΩ (µ2,ν2)(ω2, z)|2

C 1 + |x|2 + |z|2 + P i=1,2 R |α|2 dµi |ω1 ω2|2 + d2 2(µ1, µ2) .

For vΘ, apply the same argument as above to obtain a bound that also depends on the size of |α|.

Lemma 15 Let (µ, ν) AC([0, T]; P((RL+2)K) P(Q)) a weak solution to (7), then Z |α|2dµt C Z |α|2dµin + t2 . (26)

Proof By the bound (V Θ j )1 C, we conclude that |α(t, θin)| |αin| + Ct, which implies the desired bound.

A key step in the proof of existence and uniqueness, Proposition 13 and Proposition 11, is the stability of the projected ODE ﬂow.

Lemma 16 We consider (µ1, ν1), (µ2, ν2) AC([0, T]; P((RL+2)K) P(Q)) that satisfy the growth condition (26). The associated ﬂow maps (20) satisfy the bounds

|Φt (µ1,ν1)(θ1, ω1) Φt (µ2,ν2)(θ2, ω2)|2 e C(Λ+|α1|+|α2|)te Ct2|(θ1, ω1) (θ2, ω2)|2

+Ce C(Λ+|α1|+|α2|)te Ct2 Z t

0 C(r)d2 2((µ1, ν1) (r), (µ2, ν2) (r))dr,

Λ = 1 + Z |α1|2dµ1,in

1/2 + Z |α2|2dµ2,in

C(r) = e C|α!|re C|α2|re CΛre Cr2/2(Λ + r + |α1| + |α2|)

for some constant C > 0 depending on T.

Proof Recall the Lipschitz bounds on Lemma 14 are given by

CΘ(θ1, θ2) = C

1 + |α1| + |α2| + Z |α|2dµ1

1/2 + Z |α|2dµ2

1 + Z |α|2dµ1

1/2 + Z |α|2dµ2

By the bound (V Θ j )1 C, we conclude that

|α(t, θin)| |αin| + Ct.

Combining this with the growth assumption (26), we have

CΘ(θ1(t), θ2(t)) C (CΘ(θ1,in, θ2,in) + t) and CΩ(t) C (CΩ(µ1,in, µ2,in) + t) .

Taking the derivative of the distance, we ﬁnd

1 2 d dt|θ1(t) θ2(t)|2 = D θ1(t) θ2(t), V Θ (µt,1,νt,1)(θ1(t)) V Θ (µt,2,νt,2)(θ2(t)) E

CΘ(θ1(t), θ2(t)) |θ1(t) θ2(t)|2 + d2(µ1,t, µ2,t)2 + d2(ν1,t, ν2,t)2 , 1 2 d dt|ω1(t) ω2(t)|2

= D ω1(t) ω2(t), Projπ(Q)V Ω (µt,1,νt,1)(ω1(t)) Projπ(Q)V Ω (µt,2,νt,2)(ω2(t)) E

D ω1(t) ω2(t), V Ω (µt,1,νt,1)(ω1(t)) V Ω (µt,2,νt,2)(ω2(t)) E

CΩ(t) |ω1(t) ω2(t)|2 + d2(µ1,t, µ2,t)2 ,

where we have used the non-expansiveness property of the projection. Let ΛΩ= 1+ R |α|2dµ1,in 1/2+ R |α|2dµ2,in 1/2 and ΛΘ = ΛΩ+|α1(0)|+|α2(0)|. The estimates above can then be written as

d dt|θ1(t) θ2(t)|2 C(ΛΘ + t) |θ1(t) θ2(t)|2 + d2(µ1,t, µ2,t)2 + d2(ν1,t, ν2,t)2 ,

d dt|ω1(t) ω2(t)|2 C(ΛΩ+ t) |ω1(t) ω2(t)|2 + d2(µ1,t, µ2,t)2 ,

which by Gronwall s inequality implies that

|θ1(t) θ2(t)|2 e C(ΛΘt+t2)|θ1,in θ2,in|2

0 e C(ΛΘt+t2 ΛΘr r2)(ΛΘ + r) d2 2((µ1, ν1), (µ2, ν2) dr.

|ω1(t) ω2(t)|2 e C(ΛΩt+t2/2)|ω1,in ω2,in|2

0 e C(ΛΩt+t2/2 ΛΩr r2)(ΛΩ+ r)d2 2(µ1, µ2)dr.

Putting both inequalities together, we arrive at the desired result.

We now use this ODE estimate to prove Proposition 11. Proof [Proof of Proposition 11] Let

d(t) = d2 2((µ1, ν1) (t), (µ2, ν2) (t)),

and notice that for any coupling Π between µ1,in ν1,in and µ2,in ν2,in

d(t) Z |(θ1(t), ω1(t)) (θ2(t), ω2(t))|2dΠ ((θ1,in, ω1,in), (θ2,in, ω2,in)),

GAN: Dynamics

since the push-forward of Π along the ODE ﬂow at time t is a coupling between µ1,t ν1,t and µ2,t ν2,t. Using Lemma 16, we obtain that

d(t) Z e C(Λ+|α1|+|α2|)te Ct2|(θ1, ω1)(0) (θ2, ω2)(0)|2dΠ | {z } I

0 d(r) Z e C(Λ+|α1|+|α2|)(t r)e C(t2 r2)(Λ + r + |α1| + |α2|)dΠ

For I we apply the Cauchy-Schwarz and Cauchy s inequality, and take Π as the optimal coupling with respect to the 4-Wasserstein to get the bound,

I e C(Λt+t2) Z e C|α|tdµ1,in + Z e C|α|tdµ2,in

1/2 Z |(θ1, ω1)(0) (θ2, ω2)(0)|4dΠ

e C(Λt+t2) Z e C|α|tdµ1,in + Z e C|α|tdµ2,in

1/2 d2 4((µin,1, νin,1) , (µin,2, νin,2)).

We bound II from above uniformly in r by the Cauchy-Schwarz inequality,

II C Z e C|α|tdµ1,in + Z e C|α|tdµ2,in

1/2 (ΛΩ+ t) .

Therefore, we ﬁnd that for every t > 0

d(t) A(t)d2 4((µin,1, νin,1) , (µin,2, νin,2)) + B(t) Z t

where we deﬁne B(t) = CA(t)(Λ + t) and

A(t) = e C(Λt+t2) Z e C|α|tdµ1,in + Z e C|α|tdµ2,in

Gronwall s inequality implies that for all t 0

d2 2((µt,1, νt,1) , (µt,2, νt,2)) A(t)et B(t)d2 4((µin,1, νin,1) , (µin,2, νin,2)),

using the monotonicity of A(t) and B(t).

5. Continuous time approximation of parameter dynamics

Proof [Proof of Theorem 7] We consider the parameter training algorithm with learning rate h > 0 and a single hidden layer of N neurons for both the generator and the discriminator neural networks. We denote the parameter values at step n by (θn i , ωn i )i=1,...,N and the parameter dynamics is ( θn+1 i = θn i + h

N vΘ µn N,νn N (θn i , zn)

ωn+1 i = Proj Q(ωn i + h

N vΩ µn N,νn N (ωn i , zn, xn)),

where at each step we sample xn P and zn N independently, µn N denotes the empirical measure associated to θn 1 , . . . , θn N and νn N the empirical measure associated to ωn 1 , . . . , ωn N. The

parameters are assumed to be initialized by independently (θ0 i , ω0 i ) by sampling µin νin. The linear interpolation of the parameters to a continuous time variable t > 0 with time step t = h/N will be denoted by (θi, ωi), where we let θi(tn) = θn i and ωi(tn) = ωn i , with tn = n t = nh/N. We let µ and ν be the empirical measures associated to θ1, . . . , θN and ω1, . . . , ωN. We suppress the dependence on N of the measures for notational simplicity. We consider the mean ﬁeld ODE system deﬁned by the expectation of the vector ﬁelds over z and x ( d

dt ˆθi = V Θ (ˆµ,ˆν)(ˆθi)

d dt ˆωi = ProjπQ(ωi)V Ω (ˆµ,ˆν)( ˆωi),

where ˆµ and ˆν are the empirical measures associated to ˆθ1, . . . , ˆθN and ˆω1, . . . , ˆωN, respectively, and the initial conditions are coupled to the parameter training by ˆθi(0) = θ0 i and ˆωi(0) = ω0 i . More clearly, the probability measures ˆµ and ˆν are the solutions of the PDE (7) with random initial conditions chosen as (ˆµ(0), ˆν(0)) = (µN(0), νN(0)). To simplify the arguments, we ﬁrst consider the distance between mean ﬁeld ODE system and the discrete projected forward Euler algorithm

ˆθn+1 i = ˆθn i + t V Θ (ˆµn,ˆνn)(ˆθi)

ˆωn+1 i = Proj Q ˆωn i + t V Ω (ˆµn,ˆνn)(ˆωn i ) ,

where we let T > 0 be a ﬁxed time horizon and consider t = h/N, where h > 0 is the user deﬁned learning rate. To estimate the diﬀerence between the continuum and the discrete approximation, we can use a similar argument to Theorem 18, taking into consideration the bound on the Lipschitz constant of the vector ﬁelds given by Lemma 14. We can obtain the bound

i=1 |ˆθ t i ˆθ|2 #

t C 1 + Eµin h e Ce C|α|i .

The argument is simpler than the argument below, so we skip it to avoid burdensome repetition. We deﬁne

en i = |ˆθn i θn i |2 + |ˆωn i ωn i |2 and en = 1

and notice the inequality d2 2((µn, νn), (ˆµn, ˆνn)) en.

Using a step in either algorithm

en+1 i = |ˆθn i + t V Θ (ˆµn,ˆνn)(ˆθn i ) (θn i + tvΘ (µn,νn)(θn i ))|2

+|Proj Q(ˆωn i + t V Ω (ˆµn,ˆνn)(ˆωn i )) Proj Q(ωn i + tvΩ (µn,νn)(ωn i ))|2.

Using that the projection is contractive, expanding the square and bounding we obtain

en+1 i en i + t(An i + Bn i ) + ( t)2Cn i ,

An i = 2 ˆθn i θn i , V Θ (ˆµn,ˆνn)(ˆθn i ) V Θ (µn,νn)(θn i ) + ˆωn i ωn i , V Ω (ˆµn,ˆνn)(ˆωn i ) V Ω (µn,νn)(ωn i ) ,

Bn i = 2 ˆθn i θn i , V Θ (ˆµn,ˆνn)(θn i ) vΘ (µn,νn)(θn i , zn)

+2 ˆωn i ωn i , V Ω (ˆµn,ˆνn)(ωn i ) vΩ (µn,νn)(ωn i , zn, xn) ,

GAN: Dynamics

and Cn i = 2 |V Θ (ˆµn,ˆνn)(ˆθn i )|2 + |vΘ (µn,νn)(θn i )|2 + |V Ω (ˆµn,ˆνn)(ˆωn i )|2 + |vΩ (µn,νn)(ωn i )|2 .

Using the bounds Lemma 14, we get the growth bound

|αn i |, |ˆαn i | |αi,in| + Cn t,

and the estimates for n t < T

An i Ki(en i + en) and Cn i (1 + |xn|2 + |zn|2)K2 i ,

Using e0 = 0 and a telescopic sum, we get

en+1 i t Ki

r=0 (er i + er) + t

r=0 Br i + t K2 i

r=0 (1 + |xr|2 + |zr|2).

Next, we will take the conditional expectation with respect to the variables {αj,in}. To this end, we notice the bound

r=0 E[|Br i |2||{αj,in}] + 2

r2=r1+1 E[Br1 i Br2 i |{αj,in}]

r=0 E[er i + er|{αj,in}]

r=0 E[er i + er|{αj,in}]

where we have used that by Lemma 14

|Br i |2 K2 i (er i + er)

and that E[Br1 i Br2 i |{αj,in}] = 0,

which follows by using the law of iterated expectation with the sigma algebra Fr2 generated by {(θi in, ωi in}N i=1, {xr}r2 1 r=0 and {zr}r2 1 r=0 . Namely,

E[Br1 i Br2 i |{αj,in}] = E[Br1 i E[Br2 i |Fr2]|{αj,in}],

where we have used that each Br1 i is a measure with respect to Fr2 as r1 < r2. Finally, using that zr2 and xr2 are independent with respect Fr2, and that (ˆθr2 1 i , ˆωr2 1 i ) and (θr2 1 i , ωr2 1 i ) are measurable with respect Fr2, we have

E[Br2 i |Fr2] = 2Ezr2[ ˆθr2 1 i θr2 1 i , V Θ (ˆµr2 1,ˆνr2 1)(θr2 1 i ) vΘ (µr2 1,νr2 1)(θr2 1 i , zr2) ]

+2Ezr2,xr2[ ˆωr2 1 i ωr2 1 i , V Ω (ˆµr2 1,ˆνr2 1)(ωn i ) vΩ (µr2 1,νr2 1)(ωr2 1 i , zr2, xr2) ]

Using the previous bound, that the distributions for xr and zr have ﬁnite second moments, and that Ki is a deterministic function of {αj,in}, we obtain up to a change of constant

E[en+1 i |{αj,in}] t Ki

r=0 E[er i + er|{αj,in}] + K2 i t.

Applying a discrete version of Gromwal s inequality, we have

E[en+1 i |{αj,in}] t Kie TKi n X

r=0 E[er|{αj,in}] + t K2 i e TKi.

Summing over i, we obtain

E[en+1|{αj,in}] t K

r=0 E[er|{αj,in}] + t 1

i=1 K2 i e TKi,

i=1 Kie TKi.

Using discrete Gromwall s inequality one last time we have the estimate

E[en+1|{αj,in}] te TK 1

i=1 K2 i e TKi. (27)

Taking expectation, we can bound

E[en+1] t E

i=1 K4 i + 1

i=1 e2TKi #

Hence, up to changing constants we have the bound

E[en+1] t C 1 + E h e C 1

N PN i=1 e C|αi|i

= t C 1 + Eµin h e C N e C|α|i N

t C 1 + Eµin h e Ce C|α|i C t = C h

The desired bound (12) follows from using the bound (11) to show that the right hand side above is ﬁnite.

6. Conclusions and future directions

We showed rigorously and quantitatively that the Wasserstein-GAN algorithm is a stochatic discretization of the well-posed PDE system given by (7). Here, we use the insight gained from the dynamics to explain some of the pitfalls of W-GAN Arjovsky et al. (2017) that help explain why is the algorithm ﬁnicky to converge. We center in two salient points: the discontinuity of the vector

GAN: Dynamics

ﬁeld for the parameters of the discriminator network and the long time behavior of the mean ﬁeld dynamics. We noticed that the clipping of the parameters induces that the dynamics are given by a discontinuous vector ﬁeld, which forces the dynamics into a box Q. In essence, the parameters of the discriminator move within the box Q without anticipating its boundary and crash into Q. This is akin to birds ﬂying into a window. This produces blow-up of the distribution of discriminator parameters in ﬁnite time. Still, the measure valued solution is well deﬁned for all times t > 0. Most noticeably, for this solution once the dimension of the support of the measure is reduced, it will never fatten back up. In an extreme case, the dynamics can lead to the distribution of the discriminator parameters being ν(t) = δω(t) for any t > t . In the follow up work Gulrajani et al. (2017), ﬁnite time blow-up was already observed in toy numerical examples. Gulrajani et al. (2017) improves the original W-GAN algorithm by enforcing 1-Lipschitz condition with a penalization. With respect to the underlying energy functional, this is equivalent for the mean ﬁeld dynamics to considering

E[µ, ν] = Z

RL Dν(Gµ(z)) d N(z) Z

RK Dν(x) d P (x)

RK || Dν((1 s)Gµ(z) + sx)| 1|2 d P (x)d N(z)ds,

with λ being a user chosen penalization parameter. The evolution of the mean ﬁeld limit can be formally characterized as the gradient descent of E on µ and gradient ascent on ν. In terms of equations we consider

tµ θ µ θ δE

δµ [µ, ν] = 0,

tν + γc ω ν ω δE

δν [µ, ν] = 0, µ(0) = µin, ν(0) = νin.

Understanding, the diﬀerence in the dynamics for these improved algorithms is an interesting open problem. For the long time behavior of the dynamics (7), we refer to Section 3 for intuition where we show in a toy example of ODEs that for any initial conditions the dynamics stabilize to a limiting periodic orbit. Generalizing this to absolutely continuous initial data is quite complicated, we mention the recent work for the Euler equation, in Hassainia et al. (2023) the authors construct vortex patches that replicate the motion of leapfrogging vortex points. Moreover, for the general system, we expect that the dynamics will always converge to some limiting periodic orbit. Showing this rigorously is a challenging PDE problem. The generator and discriminator functions are a linear mapping of the empirical measure, hence their behavior up to redundancies of over parameterization is completely determined by the behavior of the empirical measure. The periodic behavior observed at the level of the empirical measure is expected to be reﬂected in the generator and discriminator functions, which will also exhibit periodic orbits. This points towards a fatal ﬂaw in the adversarial part of the GAN algorithm, which makes the discriminator and generator play a cat and mouse game forever. In contrast, we note that newer diﬀusion models Sohl-Dickstein et al. (2015), circumvent this problem by transforming the dynamics into a supervised learning problem. In terms of the curse of dimensionality exhibited in Corollary 9, an alternative would be to quantify the convergence of the algorithm in a Reproducing Kernel Hilbert Space (RKHS). In PDE terms, this would mean to show well posedness of the PDE in a negative Sobolev space like H s

with s > d/2.

Acknowledgments

We would like to acknowledge Justin Sirignano, Yao Yao and Federico Camara Halac for useful conversations at the beginning of this project. MGD would like to thank the Isaac Newton Institute for Mathematical Sciences, Cambridge, for support and hospitality during the program Frontiers in Kinetic Theory where part of the work on this paper was undertaken. This work was supported by EPSRC grant no EP/R014604/1. The research of MGD was partially supported by NSF-DMS2205937 and NSF-DMS RTG 1840314. The research of RC was partially supported by NSF-DMS RTG 1840314. We also like to thank the anonymous referees for helping improve the manuscript with their suggestions.

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar e. Gradient ﬂows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.

Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214 223. PMLR, 2017.

Yogesh Balaji, Mohammadmahdi Sajedi, Neha Mukund Kalibhat, Mucong Ding, Dominik St oger, Mahdi Soltanolkotabi, and Soheil Feizi. Understanding over-parameterization in generative adversarial networks. In International Conference on Learning Representations.

Andrea L Bertozzi, Thomas Laurent, and Jes us Rosado. Lp theory for the multidimensional aggregation equation. Communications on Pure and Applied Mathematics, 64(1):45 83, 2011.

Fran cois Bolley, Arnaud Guillin, and C edric Villani. Quantitative concentration inequalities for empirical measures on non-compact spaces. Probability Theory and Related Fields, 137:541 593, 2007.

JA Carrillo, M Di Francesco, A Figalli, T Laurent, and C Slepcev. Global-in-time weak measure solutions and ﬁnite-time aggregation for nonlocal interaction equations. Duke Mathematical Journal, 156(2):229 271, 2011.

Jos e A Carrillo, Robert J Mc Cann, and C edric Villani. Contractions in the 2-wasserstein length space and thermalization of granular media. Archive for Rational Mechanics and Analysis, 179: 217 263, 2006.

Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for overparameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303 314, 1989.

Guido De Philippis, Alp ar Rich ard M esz aros, Filippo Santambrogio, and Bozhidar Velichkov. Bv estimates in optimal transportation and applications. Archive for Rational Mechanics and Analysis, 219:829 860, 2016.

Simone Di Marino, Bertrand Maury, and Filippo Santambrogio. Measure sweeping processes. Journal of Convex Analysis, 23(2):567 601, 2016.

GAN: Dynamics

Richard M Dudley. Central limit theorems for empirical measures. The Annals of Probability, pages 899 929, 1978.

Xavier Fern andez-Real and Alessio Figalli. The continuous formulation of shallow neural networks as wasserstein-type gradient ﬂows. In Analysis at Large: Dedicated to the Life and Work of Jean Bourgain, pages 29 57. Springer, 2022.

Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707, August 2015. URL https://hal.science/hal-00915365.

Jean-Yves Franceschi, Emmanuel De B ezenac, Ibrahim Ayed, Micka el Chen, Sylvain Lamprier, and Patrick Gallinari. A neural tangent kernel perspective of gans. In International Conference on Machine Learning, pages 6660 6704. PMLR, 2022.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. Advances in neural information processing systems, 30, 2017.

Zineb Hassainia, Taouﬁk Hmidi, and Nader Masmoudi. Rigorous derivation of the leapfrogging motion for planar euler equations. ar Xiv preprint ar Xiv:2311.15765, 2023.

Claude Henry. An existence theorem for a class of diﬀerential equations with multivalued right-hand side. Journal of Mathematical Analysis and Applications, 41(1):179 186, 1973.

Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. ar Xiv preprint ar Xiv:1710.10196, 2017.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of gans. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=ryep FJb A-.

Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. Communications of the ACM, 60(6):84 90, 2017.

Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean ﬁeld view of the landscape of twolayer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665 E7671, 2018.

Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In International conference on machine learning, pages 3481 3490. PMLR, 2018.

Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled Generative Adversarial Networks. In International Conference on Learning Representations, 2016.

Jean Jacques Moreau. Evolution problem associated with a moving convex set in a hilbert space. Journal of diﬀerential equations, 26(3):347 374, 1977.

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.

Grant Rotskoﬀand Eric Vanden-Eijnden. Trainability and accuracy of artiﬁcial neural networks: An interacting particle system approach. Communications on Pure and Applied Mathematics, 75(9):1889 1935, 2022.

Filippo Santambrogio. Crowd motion and evolution pdes under density constraints. ESAIM: Proceedings and Surveys, 64:137 157, 2018.

Justin Sirignano and Konstantinos Spiliopoulos. Mean ﬁeld analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80(2):725 752, 2020a.

Justin Sirignano and Konstantinos Spiliopoulos. Mean ﬁeld analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820 1852, 2020b.

Justin Sirignano and Konstantinos Spiliopoulos. Mean ﬁeld analysis of deep neural networks. Mathematics of Operations Research, 47(1):120 152, 2022.

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256 2265. PMLR, 2015.

Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. Advances in neural information processing systems, 30, 2017.

Hoang Thanh-Tung and Truyen Tran. Catastrophic forgetting and mode collapse in gans. In 2020 international joint conference on neural networks (ijcnn), pages 1 10. IEEE, 2020.

Tijmen Tieleman. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26, 2012.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Stephan Wojtowytsch and Weinan E. Can shallow neural networks beat the curse of dimensionality? a mean ﬁeld training perspective. IEEE Transactions on Artiﬁcial Intelligence, 1(2):121 129, 2020. doi: 10.1109/TAI.2021.3051357.

Zhiming Zhou, Jiadong Liang, Yuxuan Song, Lantao Yu, Hongwei Wang, Weinan Zhang, Yong Yu, and Zhihua Zhang. Lipschitz generative adversarial nets. In International Conference on Machine Learning, pages 7584 7593. PMLR, 2019.

GAN: Dynamics

Appendix A.

Following the ideas of Henry (1973), in this section we prove the existence, uniqueness and stability to a class of ODEs with discontinuous forcing given by a projection. We also show quantitative convergence of the projected forward Euler algorithm, for which we could not ﬁnd a good reference for. Before we present the main result, we introduce some notation that we need. For any closed convex subset Q Rd and x Rd there exists an unique Proj Qx Q such that

Proj Qx x = min q Q q x .

The map Proj Q is non-expansive, which means that for all x, y Rd:

Proj Q(x) Proj Q(y) x y .

We denote by πQ(x) Rd the tangent cone of Q at x Q,

πQ(x) = {v Rd | ϵ > 0, x + ϵv Q} = v Rd lim h 0+ d(x + hv, Q)

which is a closed convex cone. The map ProjπQ(x) : Rd Rd denotes the projection onto πQ(x) Rd. We notice that for a smooth vector ﬁeld V (x) : Q Rd, the mapping x Rn 7 ProjπQ(x)(V (x)) is discontinuous at points x such that V (x) / πQ(x).

Theorem 17 (Henry (1973)) Let Q Rd be a closed and convex subset of Rd and V : Rd Rd

a C1 vector ﬁeld, which satisﬁes that there exists C > 0, such that |V (x)|, | V (x)| C. Then, for any initial condition xin Q there exists a unique absolutely continuous curve x : [0, ) Q such that ( x = ProjπQ(x)V (x),

x(0) = xin, (29)

with the equality satisﬁed for almost every t. Moreover, the solutions are also stable with respect to the initial condition xin:

x1(t) x2(t) e V t x1(0) x2(0) for any t > 0,

where x1 and x2 are two solution to (29).

Moreover, we can approximate these solutions by a projected forward Euler algorithm.

Theorem 18 Let x t : [0, ) Q be the linear interpolation at times n t of {xn t} deﬁned by the projected Euler algorithm ( xn+1 t = Proj Q(xn t + t V (xn t)) x0 t = xin. (30)

Then, for any time horizon T > 0, as t 0 we have

x t x e(1+ V )t 2( t)1/2 V + t V V ,

where x is the unique solution of (29).

Proof [Proof of Theorem 17 and Theorem 18] For each x Q, we deﬁne the normal cone NQ(x) as NQ(x) = {n Rd | q Q : n, q x 0},

or equivalently the set of vectors n Rd such that n, w 0 for all w πQ(x). It follows directly from the projection property the following useful result.

Lemma 19 For any v Rd and x Q, the vector nx = v ProjπQ(x)v is orthogonal to ProjπQ(x)v and nx NQ(x). Conversely, if w πQ(x) is that nx = v w NQ(x) and w, v w = 0, then w = ProjπQ(x)v.

Uniqueness and Stability. Suppose that x1 : [0, T] Q and x2 : [0, T] Q are solutions of (29). Then, d dt 1 2 x1 x2 2 = x1 x2, ProjπQ(x1)V (x1) ProjπQ(x2)V (x2) .

Using the property of the projection we have

x1 x2, ProjπQ(x1)V (x1) ProjπQ(x2)V (x2) x1 x2, V (x1) V (x2)

where we have used Lemma 19 for the ﬁrst inequality, and the Lipschitz propety for the second inequality. Grownwall s inequality applied to x1 x2 2 gives:

x1(t) x2(t) e V t x1(0) x2(0) ,

which shows the uniqueness and stability of solutions with respect to the initial condition. Equivalence with a relaxed problem. Using NQ(x), we now introduce a relaxed problem which we prove is equivalent to the ODE (29). For each x Q we deﬁne the compact convex set V(x) Rd by V(x) = {V (x) nx | nx NQ(x), nx 2 V (x) nx}.

The relaxed problem is ﬁnding an absolutely continuous curve x : [0, T] Q such that ( x(t) V(x(t)) x(0) = xin, (31)

for almost every t [0, T]. To show the equivalence between (29) and (31), we need the following Lemma.

Lemma 20 For all x Q we have V (x) V(x), ProjπQ(x)V (x) V(x) and

V(x) πQ(x) = {ProjπQ(x)V (x)}.

Proof [Proof of Lemma 20] Taking nx = 0 in the deﬁnition of V(x) gives that V (x) V(x). Writing nx = V (x) ProjπQ(x)V (x), we recall from Lemma 19 that nx NQ(x) and

nx, ProjπQ(x)V (x) = 0,

so nx 2 = nx, V (x) and we conclude ProjπQ(x)V (x) V(x). Now note that if V (x) nx πQ(x) with nx NQ(x), then V (x) nx, nx 0 with equality only if V (x) nx = ProjπQ(x)V (x), as we have noted above. So if V (x) nx πQ(x) V(x) then V (x) nx = ProjπQ(x)V (x).

GAN: Dynamics

An absolutely continuous curve x : [0, T] Q is such that

lim h 0+ x(t + h) x(t) x(t) h

for almost every t. Since x(t) Q for all t [0, T],

0 = lim h 0+ d(x(t + h), Q)

h = lim h 0+ d(x(t) + h x(t), Q)

which shows x(t) πQ(x(t)). If we have a solution to the relaxed problem, then the diﬀerential inclusion x(t) V(x(t)) is satisﬁed almost everywhere, therefore we have x(t) = ProjπQ(x)V (x) since by Lemma 20 V(x) πQ(x) = {ProjπQ(x)V (x)}, and we conclude that (31) and (29) are equivalent. Existence. Consider νQ(xn+1 t ) NQ(xn+1 t ) unit vectors and 0 λ 1 such that

xn+1 t = xn t + t V (xn t) t λ (V (xn t) νQ(xn+1 t ))+νQ(xn+1 t ),

which follows directly from the properties of the projection. For each n 0 we consider the discrete velocity

un t = xn+1 t xn t t ,

which we re-write as

un t = V (xn+1 t ) λ(V (xn+1 t ) νQ(xn+1 t ))+νQ(xn+1 t ) + V (xn t) V (xn+1 t ) | {z } I + λ ((V (xn t) νQ(xn+1 t ))+ (V (xn+1 t ) νQ(xn+1 t ))+)νQ(xn+1 t ) | {z } II

We notice the bounds

|I|, |II| V |xn+1 t xn t| t V V .

Therefore, letting B1 denote the unit ball centred at the origin,

un t V(xn+1 t ) + V V t B1.

Hence, for any t > 0 we can conclude that for a.e. t

(x t(t), x t(t)) Graph(V) + t( V B1 V V B1), (32)

Noting that x t is uniformly Lipschitz with constant less that V , we get up to subsequence there exists a Lipschitz function X : [0, ) Q such that x t X uniformly at compact subintervals, by Arzerl a-Ascoli. We conclude using Mazur s Lemma that the derivative of X belongs almost everywhere to the upper limit of the convex hull of the values of x t(t),

X(t) lim sup ϵ 0+ co( x t(t)0< t<ϵ).

Using that V(x(t)) is convex and closed, we conclude

(X(t), X(t)) Graph(V),

which implies that X is a solution to the relaxed problem, and therefore a solution to the original (29). Quantitative Estimate. We diﬀerentiate the distance, between X and x t to obtain

1 2 d dt|X x t|2 = X x t, X x t

X x t, V (X) V (x t) + t V | X x t|

+ t V V |X x t|

(1 + V )|X x t|2 + 2 t V 2 + ( t)2 V 2 V 2 ,

where we have used estimate (32) and the contraction to property. Using Gromwall s inequality and that |X(0) x t(0)|2 = 0 , we obtain

|X x t|2 e2(1+ V )t 2 t V 2 + ( t)2 V 2 V 2 .